From: Erik J. <er...@sg...> - 2005-09-24 01:13:41
|
> This looks pretty cool, and to show how it's useful you could convert > a few of the fork/exec calls to less important subsystems (e.g. keys) > to this. > > I don't think rwsem locking is apropinquate for what you're doing. We need > to keep the overhead as low as possible in fork/exec/exit, and given how > infrequent modifications are this looks like a prime candidate for RCU. This is a test patch / proof of concept. I don't understand all implications of changes to the keyring subsystem. I know this boots and /proc/keys looks like it did before my changes. I'm open to learning enough about keyrings to test this patch further if it's requested. Let me know. I do know that the exec_keys function as well as key_fsuid_changed and key_fsgid_changed happen a lot and these are all paths that need to retrieve data from the subscriber list in pnotify now. The callout locations of pnotify were not exactly the same as the callout locations used by the key subsystem. This version still uses an rwsem lock for pnotify_subscriber_list_sem in each task. I'll look in to an RCU version next week and re-post. Because the keyring subsystem isn't a kernel module, a simple pnotify_subscribe is all we need in the key_init() function to get the ball rolling. At the time that key_init() is run, the PID is zero. pnotify handles making sure the kernel module is subscribed to the children automatically. If this were a kernel module and we wanted to be sure to be notified for all tasks, we could use the 'init' event to subscribe us to all tasks in the system at registration time. This is unnecessary in this example. include/linux/key.h | 21 +++++ include/linux/sched.h | 4 - kernel/exit.c | 1 kernel/fork.c | 6 - security/keys/key.c | 22 ++++++ security/keys/keyctl.c | 28 ++++++- security/keys/process_keys.c | 152 ++++++++++++++++++++++++++++++++++++------- security/keys/request_key.c | 31 +++++++- 8 files changed, 222 insertions(+), 43 deletions(-) Index: linux/include/linux/key.h =================================================================== --- linux.orig/include/linux/key.h 2005-09-23 13:59:05.716416954 -0500 +++ linux/include/linux/key.h 2005-09-23 15:28:01.249594001 -0500 @@ -19,6 +19,7 @@ #include <linux/list.h> #include <linux/rbtree.h> #include <linux/rcupdate.h> +#include <linux/pnotify.h> #include <asm/atomic.h> #ifdef __KERNEL__ @@ -262,9 +263,9 @@ extern struct key root_user_keyring, root_session_keyring; extern int alloc_uid_keyring(struct user_struct *user); extern void switch_uid_keyring(struct user_struct *new_user); -extern int copy_keys(unsigned long clone_flags, struct task_struct *tsk); +extern int copy_keys(struct task_struct *tsk, struct pnotify_subscriber *sub, void *olddata); extern int copy_thread_group_keys(struct task_struct *tsk); -extern void exit_keys(struct task_struct *tsk); +extern void exit_keys(struct task_struct *task, struct pnotify_subscriber *sub); extern void exit_thread_group_keys(struct signal_struct *tg); extern int suid_keys(struct task_struct *tsk); extern int exec_keys(struct task_struct *tsk); @@ -279,6 +280,22 @@ old_session; \ }) +/* pnotify subscriber service request */ +static struct pnotify_events key_events = { + .module = NULL, + .name = "key", + .data = NULL, + .entry = LIST_HEAD_INIT(key_events.entry), + .fork = copy_keys, + .exit = exit_keys, +}; + +/* key info associated with the task struct and managed by pnotify */ +struct key_task { + struct key *thread_keyring; /* keyring private to this thread */ + unsigned char jit_keyring; /* default keyring to attach requested keys to */ +}; + #else /* CONFIG_KEYS */ #define key_validate(k) 0 Index: linux/kernel/exit.c =================================================================== --- linux.orig/kernel/exit.c 2005-09-23 13:59:05.753522701 -0500 +++ linux/kernel/exit.c 2005-09-23 14:00:28.856701430 -0500 @@ -843,7 +843,6 @@ exit_namespace(tsk); exit_thread(); cpuset_exit(tsk); - exit_keys(tsk); if (group_dead && tsk->signal->leader) disassociate_ctty(1); Index: linux/kernel/fork.c =================================================================== --- linux.orig/kernel/fork.c 2005-09-23 13:59:05.767193239 -0500 +++ linux/kernel/fork.c 2005-09-23 14:00:28.868419028 -0500 @@ -979,10 +979,8 @@ goto bad_fork_cleanup_sighand; if ((retval = copy_mm(clone_flags, p))) goto bad_fork_cleanup_signal; - if ((retval = copy_keys(clone_flags, p))) - goto bad_fork_cleanup_mm; if ((retval = copy_namespace(clone_flags, p))) - goto bad_fork_cleanup_keys; + goto bad_fork_cleanup_mm; retval = copy_thread(0, clone_flags, stack_start, stack_size, p, regs); if (retval) goto bad_fork_cleanup_namespace; @@ -1138,8 +1136,6 @@ bad_fork_cleanup_namespace: pnotify_exit(p); exit_namespace(p); -bad_fork_cleanup_keys: - exit_keys(p); bad_fork_cleanup_mm: if (p->mm) mmput(p->mm); Index: linux/security/keys/key.c =================================================================== --- linux.orig/security/keys/key.c 2005-09-23 13:59:05.775004975 -0500 +++ linux/security/keys/key.c 2005-09-23 17:57:36.519925909 -0500 @@ -15,6 +15,7 @@ #include <linux/slab.h> #include <linux/workqueue.h> #include <linux/err.h> +#include <linux/pnotify.h> #include "internal.h" static kmem_cache_t *key_jar; @@ -1009,6 +1010,9 @@ */ void __init key_init(void) { + struct key_task *kt; + struct pnotify_subscriber *sub; + /* allocate a slab in which we can store keys */ key_jar = kmem_cache_create("key_jar", sizeof(struct key), 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL); @@ -1039,4 +1043,22 @@ /* link the two root keyrings together */ key_link(&root_session_keyring, &root_user_keyring); + /* Allocate memory for task assocated key_task structure */ + kt = (struct key_task *)kmalloc(sizeof(struct key_task),GFP_KERNEL); + if (!kt) { + printk(KERN_ERR "Insufficient memory to allocate key_task structure" + " in key_init function.\n"); + return; + } + kt->thread_keyring = NULL; + + /* subscribe this kernel entity to the subscriber list for current task */ + sub = pnotify_subscribe(current, &key_events); + if (!sub) { + printk(KERN_ERR "Insufficient memory to add to subscriber list structure" + " in key_init function.\n"); + } + /* Associate the kt structure with this task via pnotify subscriber */ + sub->data = (void *)kt; + } /* end key_init() */ Index: linux/security/keys/process_keys.c =================================================================== --- linux.orig/security/keys/process_keys.c 2005-09-23 13:59:05.783793178 -0500 +++ linux/security/keys/process_keys.c 2005-09-23 20:02:51.419625275 -0500 @@ -16,6 +16,7 @@ #include <linux/keyctl.h> #include <linux/fs.h> #include <linux/err.h> +#include <linux/pnotify.h> #include <asm/uaccess.h> #include "internal.h" @@ -137,6 +138,8 @@ int install_thread_keyring(struct task_struct *tsk) { struct key *keyring, *old; + struct key_task *kt; + struct pnotify_subscriber *sub; char buf[20]; int ret; @@ -149,9 +152,21 @@ } task_lock(tsk); - old = tsk->thread_keyring; - tsk->thread_keyring = keyring; + down_write(&tsk->pnotify_subscriber_list_sem); + sub = pnotify_get_subscriber(tsk, key_events.name); + if (sub == NULL || sub->data == NULL) { /* shouldn't happen */ + printk(KERN_ERR "install_thread_keyring pnotify subscriber or data ptr null, task: %d\n", tsk->pid); + task_unlock(tsk); + up_write(&tsk->pnotify_subscriber_list_sem); + ret = PTR_ERR(sub); + goto error; + } + kt = (struct key_task *)sub->data; + + old = kt->thread_keyring; + kt->thread_keyring = keyring; task_unlock(tsk); + up_write(&tsk->pnotify_subscriber_list_sem); ret = 0; @@ -267,13 +282,25 @@ /* * copy the keys for fork */ -int copy_keys(unsigned long clone_flags, struct task_struct *tsk) +int copy_keys(struct task_struct *tsk, struct pnotify_subscriber *sub, void *olddata) { - key_check(tsk->thread_keyring); + struct key_task *kt = ((struct key_task *)(sub->data)); + + /* Allocate memory for task-associated key_task structure */ + kt = (struct key_task *)kmalloc(sizeof(struct key_task),GFP_KERNEL); + if (!kt) { + printk(KERN_ERR "Insufficient memory to allocate key_task structure" + " in copy_keys function. Task was: %d", tsk->pid); + return PNOTIFY_ERROR; + } + /* Associate key_task structure with the new child via pnotify subscriber */ + sub->data = (void *)kt; + + key_check(kt->thread_keyring); /* no thread keyring yet */ - tsk->thread_keyring = NULL; - return 0; + kt->thread_keyring = NULL; + return PNOTIFY_OK; } /* end copy_keys() */ @@ -292,9 +319,16 @@ /* * dispose of keys upon thread exit */ -void exit_keys(struct task_struct *tsk) +void exit_keys(struct task_struct *task, struct pnotify_subscriber *sub) { - key_put(tsk->thread_keyring); + struct key_task *kt = ((struct key_task *)(sub->data)); + if (kt == NULL) { /* shouldn't ever happen */ + printk(KERN_ERR "exit_keys pnotify subscriber data ptr null, task: %d\n", task->pid); + return; + } + key_put(kt->thread_keyring); + kfree(kt); /* Free pnotify subscriber data for this task */ + sub->data = NULL; } /* end exit_keys() */ @@ -306,12 +340,28 @@ { unsigned long flags; struct key *old; + struct key_task *kt; + struct pnotify_subscriber *sub; - /* newly exec'd tasks don't get a thread keyring */ task_lock(tsk); - old = tsk->thread_keyring; - tsk->thread_keyring = NULL; + /* pnotify doesn't have a compute_creds event at this time, so we + * need to retrieve the data */ + + down_write(&tsk->pnotify_subscriber_list_sem); + sub = pnotify_get_subscriber(tsk, key_events.name); + if (sub == NULL || sub->data == NULL) { /* shouldn't happen */ + printk(KERN_ERR "exec_keys pnotify subscriber or data ptr null, task: %d\n", tsk->pid); + task_unlock(tsk); + up_write(&tsk->pnotify_subscriber_list_sem); + return PNOTIFY_OK; /* key structures not populated yet */ + } + kt = (struct key_task *)sub->data; + + /* newly exec'd tasks don't get a thread keyring */ + old = kt->thread_keyring; + kt->thread_keyring = NULL; task_unlock(tsk); + up_write(&tsk->pnotify_subscriber_list_sem); key_put(old); @@ -344,12 +394,26 @@ */ void key_fsuid_changed(struct task_struct *tsk) { + struct key_task *kt; + struct pnotify_subscriber *sub; + + /* no pnotify event for this, so we need to grab the data */ + down_write(&tsk->pnotify_subscriber_list_sem); + sub = pnotify_get_subscriber(tsk, key_events.name); + if (sub == NULL || sub->data == NULL) { /* shouldn't happen */ + printk(KERN_ERR "key_fsuid_changed pnotify subscriber or data ptr null, task: %d\n", tsk->pid); + up_write(&tsk->pnotify_subscriber_list_sem); + return; + } + kt = (struct key_task *)sub->data; + /* update the ownership of the thread keyring */ - if (tsk->thread_keyring) { - down_write(&tsk->thread_keyring->sem); - tsk->thread_keyring->uid = tsk->fsuid; - up_write(&tsk->thread_keyring->sem); + if (kt->thread_keyring) { + down_write(&kt->thread_keyring->sem); + kt->thread_keyring->uid = tsk->fsuid; + up_write(&kt->thread_keyring->sem); } + up_write(&tsk->pnotify_subscriber_list_sem); } /* end key_fsuid_changed() */ @@ -359,12 +423,26 @@ */ void key_fsgid_changed(struct task_struct *tsk) { + struct key_task *kt; + struct pnotify_subscriber *sub; + + /* pnotify doesn't have an event for this, so we need to grab the data */ + down_write(&tsk->pnotify_subscriber_list_sem); + sub = pnotify_get_subscriber(tsk, key_events.name); + if (sub == NULL || sub->data == NULL) { /* shouldn't happen */ + printk(KERN_ERR "key_fsgid_changed pnotify subscriber or data ptr was null, task: %d\n", tsk->pid); + up_write(&tsk->pnotify_subscriber_list_sem); + return; + } + kt = (struct key_task *)sub->data; + /* update the ownership of the thread keyring */ - if (tsk->thread_keyring) { - down_write(&tsk->thread_keyring->sem); - tsk->thread_keyring->gid = tsk->fsgid; - up_write(&tsk->thread_keyring->sem); + if (kt->thread_keyring) { + down_write(&kt->thread_keyring->sem); + kt->thread_keyring->gid = tsk->fsgid; + up_write(&kt->thread_keyring->sem); } + up_write(&tsk->pnotify_subscriber_list_sem); } /* end key_fsgid_changed() */ @@ -383,6 +461,8 @@ { struct request_key_auth *rka; struct key *key, *ret, *err, *instkey; + struct pnotify_subscriber *sub; + struct key_task *kt; /* we want to return -EAGAIN or -ENOKEY if any of the keyrings were * searchable, but we failed to find a key or we found a negative key; @@ -395,12 +475,23 @@ ret = NULL; err = ERR_PTR(-EAGAIN); + down_write(&context->pnotify_subscriber_list_sem); + sub = pnotify_get_subscriber(context, key_events.name); + if (sub == NULL || sub->data == NULL) { + printk(KERN_ERR "search_process_keyrings pnotify subscriber or data ptr null, task: %d\n", context->pid); + up_write(&context->pnotify_subscriber_list_sem); + return (struct key *)-EFAULT; + } + kt = (struct key_task *)sub->data; + /* search the thread keyring first */ - if (context->thread_keyring) { - key = keyring_search_aux(context->thread_keyring, + if (kt->thread_keyring) { + key = keyring_search_aux(kt->thread_keyring, context, type, description, match); - if (!IS_ERR(key)) + if (!IS_ERR(key)) { + up_write(&context->pnotify_subscriber_list_sem); goto found; + } switch (PTR_ERR(key)) { case -EAGAIN: /* no key */ @@ -414,6 +505,7 @@ break; } } + up_write(&context->pnotify_subscriber_list_sem); /* search the process keyring second */ if (context->signal->process_keyring) { @@ -535,15 +627,26 @@ { struct key *key; int ret; + struct pnotify_subscriber *sub; + struct key_task *kt; if (!context) context = current; key = ERR_PTR(-ENOKEY); + down_write(&context->pnotify_subscriber_list_sem); + sub = pnotify_get_subscriber(context, key_events.name); + if (sub == NULL || sub->data == NULL) { /* shouldn't happen */ + printk(KERN_ERR "search_process_keyrings pnotify subscriber or data ptr null, task: %d\n", context->pid); + up_write(&context->pnotify_subscriber_list_sem); + return (struct key *)-EFAULT; + } + kt = (struct key_task *)sub->data; + switch (id) { case KEY_SPEC_THREAD_KEYRING: - if (!context->thread_keyring) { + if (!kt->thread_keyring) { if (!create) goto error; @@ -554,7 +657,7 @@ } } - key = context->thread_keyring; + key = kt->thread_keyring; atomic_inc(&key->usage); break; @@ -634,6 +737,7 @@ goto invalid_key; error: + up_write(&context->pnotify_subscriber_list_sem); return key; invalid_key: Index: linux/include/linux/sched.h =================================================================== --- linux.orig/include/linux/sched.h 2005-09-23 12:19:36.300574288 -0500 +++ linux/include/linux/sched.h 2005-09-23 18:22:52.054928285 -0500 @@ -687,10 +687,6 @@ kernel_cap_t cap_effective, cap_inheritable, cap_permitted; unsigned keep_capabilities:1; struct user_struct *user; -#ifdef CONFIG_KEYS - struct key *thread_keyring; /* keyring private to this thread */ - unsigned char jit_keyring; /* default keyring to attach requested keys to */ -#endif int oomkilladj; /* OOM kill score adjustment (bit shift). */ char comm[TASK_COMM_LEN]; /* executable name excluding path - access with [gs]et_task_comm (which lock Index: linux/security/keys/keyctl.c =================================================================== --- linux.orig/security/keys/keyctl.c 2005-09-23 12:19:15.785063745 -0500 +++ linux/security/keys/keyctl.c 2005-09-23 18:45:09.038763524 -0500 @@ -928,31 +928,51 @@ long keyctl_set_reqkey_keyring(int reqkey_defl) { int ret; + unsigned char jit_return; + struct pnotify_subscriber *sub; + struct key_task *kt; + + down_write(¤t->pnotify_subscriber_list_sem); + sub = pnotify_get_subscriber(current, key_events.name); + if (sub == NULL || sub->data == NULL) { /* shouldn't happen */ + printk(KERN_ERR "keyctl_set_reqkey_keyring pnotify subscriber or data ptr null, task: %d\n", current->pid); + up_write(¤t->pnotify_subscriber_list_sem); + return -EFAULT; + } + kt = (struct key_task *)sub->data; switch (reqkey_defl) { case KEY_REQKEY_DEFL_THREAD_KEYRING: ret = install_thread_keyring(current); - if (ret < 0) + if (ret < 0) { + up_write(¤t->pnotify_subscriber_list_sem); return ret; + } goto set; case KEY_REQKEY_DEFL_PROCESS_KEYRING: ret = install_process_keyring(current); - if (ret < 0) + if (ret < 0) { + up_write(¤t->pnotify_subscriber_list_sem); return ret; + } case KEY_REQKEY_DEFL_DEFAULT: case KEY_REQKEY_DEFL_SESSION_KEYRING: case KEY_REQKEY_DEFL_USER_KEYRING: case KEY_REQKEY_DEFL_USER_SESSION_KEYRING: set: - current->jit_keyring = reqkey_defl; + + kt->jit_keyring = reqkey_defl; case KEY_REQKEY_DEFL_NO_CHANGE: - return current->jit_keyring; + jit_return = kt->jit_keyring; + up_write(¤t->pnotify_subscriber_list_sem); + return jit_return; case KEY_REQKEY_DEFL_GROUP_KEYRING: default: + up_write(¤t->pnotify_subscriber_list_sem); return -EINVAL; } Index: linux/security/keys/request_key.c =================================================================== --- linux.orig/security/keys/request_key.c 2005-09-23 12:19:16.032109161 -0500 +++ linux/security/keys/request_key.c 2005-09-23 20:03:30.675442510 -0500 @@ -14,6 +14,7 @@ #include <linux/kmod.h> #include <linux/err.h> #include <linux/keyctl.h> +#include <linux/pnotify.h> #include "internal.h" struct key_construction { @@ -39,6 +40,17 @@ char *argv[10], *envp[3], uid_str[12], gid_str[12]; char key_str[12], keyring_str[3][12]; int ret, i; + struct pnotify_subscriber *sub; + struct key_task *kt; + + down_write(&tsk->pnotify_subscriber_list_sem); + sub = pnotify_get_subscriber(current, key_events.name); + if (sub == NULL || sub->data == NULL) { /* shouldn't happen */ + printk(KERN_ERR "call_request_key pnotify subscriber or data ptr null, task: %d\n", tsk->pid); + up_write(&tsk->pnotify_subscriber_list_sem); + return -EFAULT; + } + kt = (struct key_task *)sub->data; kenter("{%d},%s,%s", key->serial, op, callout_info); @@ -58,7 +70,7 @@ /* we specify the process's default keyrings */ sprintf(keyring_str[0], "%d", - tsk->thread_keyring ? tsk->thread_keyring->serial : 0); + kt->thread_keyring ? kt->thread_keyring->serial : 0); prkey = 0; if (tsk->signal->process_keyring) @@ -105,6 +117,7 @@ key_put(session_keyring); error: + up_write(&tsk->pnotify_subscriber_list_sem); kleave(" = %d", ret); return ret; @@ -300,15 +313,26 @@ { struct task_struct *tsk = current; struct key *drop = NULL; + struct pnotify_subscriber *sub; + struct key_task *kt; + + down_write(&tsk->pnotify_subscriber_list_sem); + sub = pnotify_get_subscriber(current, key_events.name); + if (sub == NULL || sub->data == NULL) { /* shouldn't happen */ + printk(KERN_ERR "request_key_link pnotify subscriber or data ptr null, task: %d\n", tsk->pid); + up_write(&tsk->pnotify_subscriber_list_sem); + return; + } + kt = (struct key_task *)sub->data; kenter("{%d},%p", key->serial, dest_keyring); /* find the appropriate keyring */ if (!dest_keyring) { - switch (tsk->jit_keyring) { + switch (kt->jit_keyring) { case KEY_REQKEY_DEFL_DEFAULT: case KEY_REQKEY_DEFL_THREAD_KEYRING: - dest_keyring = tsk->thread_keyring; + dest_keyring = kt->thread_keyring; if (dest_keyring) break; @@ -347,6 +371,7 @@ key_put(drop); kleave(""); + down_write(&tsk->pnotify_subscriber_list_sem); } /* end request_key_link() */ |
From: Erik J. <er...@sg...> - 2005-09-29 16:53:52
|
Here is my frist pass at a version of pnotify with RCU. Later, I'll post some performance data on this patch with the pnotify-RCU-aware job and keyring support (separate posts). I did receive some feedback on mostly style issues that aren't relevant to the performance discussion. I'll fix those later. My feeling is we shouldn't use this RCU-protected-subscriber-list version of pnotify - see my performance post to follow. It includes notes about sleeping while rcu_read_lock is held, for exmaple. Posts to follow: RCU-pnotify aware version of Job RCU-pnotify aware version of keyring support Performance comparisons Documentation/pnotify.txt | 369 ++++++++++++++++++++++++++++++ fs/exec.c | 2 include/linux/init_task.h | 2 include/linux/pnotify.h | 265 ++++++++++++++++++++++ include/linux/sched.h | 5 init/Kconfig | 8 kernel/Makefile | 1 kernel/exit.c | 4 kernel/fork.c | 14 + kernel/pnotify.c | 554 ++++++++++++++++++++++++++++++++++++++++++++++ 10 files changed, 1224 insertions(+) Index: linux/fs/exec.c =================================================================== --- linux.orig/fs/exec.c 2005-09-19 22:00:41.000000000 -0500 +++ linux/fs/exec.c 2005-09-27 10:43:32.271608463 -0500 @@ -48,6 +48,7 @@ #include <linux/syscalls.h> #include <linux/rmap.h> #include <linux/acct.h> +#include <linux/pnotify.h> #include <asm/uaccess.h> #include <asm/mmu_context.h> @@ -1203,6 +1204,7 @@ retval = search_binary_handler(bprm,regs); if (retval >= 0) { free_arg_pages(bprm); + pnotify_exec(current); /* execve success */ security_bprm_free(bprm); Index: linux/include/linux/init_task.h =================================================================== --- linux.orig/include/linux/init_task.h 2005-09-19 22:00:41.000000000 -0500 +++ linux/include/linux/init_task.h 2005-09-27 10:43:32.304808221 -0500 @@ -2,6 +2,7 @@ #define _LINUX__INIT_TASK_H #include <linux/file.h> +#include <linux/pnotify.h> #include <linux/rcupdate.h> #define INIT_FDTABLE \ @@ -121,6 +122,7 @@ .proc_lock = SPIN_LOCK_UNLOCKED, \ .journal_info = NULL, \ .cpu_timers = INIT_CPU_TIMERS(tsk.cpu_timers), \ + INIT_TASK_PNOTIFY(tsk) \ .fs_excl = ATOMIC_INIT(0), \ } Index: linux/include/linux/sched.h =================================================================== --- linux.orig/include/linux/sched.h 2005-09-19 22:00:41.000000000 -0500 +++ linux/include/linux/sched.h 2005-09-27 17:46:55.432769228 -0500 @@ -795,6 +795,11 @@ struct mempolicy *mempolicy; short il_next; #endif +#ifdef CONFIG_PNOTIFY +/* List of pnotify kernel module subscribers */ + struct list_head pnotify_subscriber_list; + struct rw_semaphore pnotify_subscriber_list_sem; +#endif #ifdef CONFIG_CPUSETS struct cpuset *cpuset; nodemask_t mems_allowed; Index: linux/init/Kconfig =================================================================== --- linux.orig/init/Kconfig 2005-09-19 22:00:41.000000000 -0500 +++ linux/init/Kconfig 2005-09-27 17:46:52.034674237 -0500 @@ -162,6 +162,14 @@ for processing it. A preliminary version of these tools is available at <http://www.physik3.uni-rostock.de/tim/kernel/utils/acct/>. +config PNOTIFY + bool "Support for Process Notification" + help + Say Y here if you will be loading modules which provide support + for process notification. Examples of such modules include the + Linux Jobs module and the Linux Array Sessions module. If you will not + be using such modules, say N. + config SYSCTL bool "Sysctl support" ---help--- Index: linux/kernel/Makefile =================================================================== --- linux.orig/kernel/Makefile 2005-09-19 22:00:41.000000000 -0500 +++ linux/kernel/Makefile 2005-09-27 17:46:52.056156447 -0500 @@ -20,6 +20,7 @@ obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o obj-$(CONFIG_KEXEC) += kexec.o obj-$(CONFIG_COMPAT) += compat.o +obj-$(CONFIG_PNOTIFY) += pnotify.o obj-$(CONFIG_CPUSETS) += cpuset.o obj-$(CONFIG_IKCONFIG) += configs.o obj-$(CONFIG_IKCONFIG_PROC) += configs.o Index: linux/kernel/fork.c =================================================================== --- linux.orig/kernel/fork.c 2005-09-19 22:00:41.000000000 -0500 +++ linux/kernel/fork.c 2005-09-27 17:46:55.434722156 -0500 @@ -42,6 +42,7 @@ #include <linux/profile.h> #include <linux/rmap.h> #include <linux/acct.h> +#include <linux/pnotify.h> #include <asm/pgtable.h> #include <asm/pgalloc.h> @@ -151,6 +152,9 @@ init_task.signal->rlim[RLIMIT_NPROC].rlim_max = max_threads/2; init_task.signal->rlim[RLIMIT_SIGPENDING] = init_task.signal->rlim[RLIMIT_NPROC]; + + /* Initialize the pnotify list in pid 0 before it can clone itself. */ + INIT_PNOTIFY_LIST(current); } static struct task_struct *dup_task_struct(struct task_struct *orig) @@ -1039,6 +1043,15 @@ p->exit_state = 0; /* + * Call pnotify kernel module subscribers and add the same subscribers the + * parent has to the new process. + * Fail the fork on error. + */ + retval = pnotify_fork(p, current); + if (retval) + goto bad_fork_cleanup_namespace; + + /* * Ok, make it visible to the rest of the system. * We dont wake it up yet. */ @@ -1160,6 +1173,7 @@ return p; bad_fork_cleanup_namespace: + pnotify_exit(p); exit_namespace(p); bad_fork_cleanup_keys: exit_keys(p); Index: linux/kernel/exit.c =================================================================== --- linux.orig/kernel/exit.c 2005-09-27 10:43:30.570609074 -0500 +++ linux/kernel/exit.c 2005-09-27 17:46:55.433745692 -0500 @@ -29,6 +29,7 @@ #include <linux/proc_fs.h> #include <linux/mempolicy.h> #include <linux/cpuset.h> +#include <linux/pnotify.h> #include <linux/syscalls.h> #include <linux/signal.h> @@ -866,6 +867,9 @@ module_put(tsk->binfmt->module); tsk->exit_code = code; + + pnotify_exit(tsk); + exit_notify(tsk); #ifdef CONFIG_NUMA mpol_free(tsk->mempolicy); Index: linux/kernel/pnotify.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux/kernel/pnotify.c 2005-09-27 17:53:40.296257924 -0500 @@ -0,0 +1,554 @@ +/* + * Process Notification (pnotify) interface + * + * + * Copyright (c) 2000-2005 Silicon Graphics, Inc. All Rights Reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * Contact information: Silicon Graphics, Inc., 1500 Crittenden Lane, + * Mountain View, CA 94043, or: + * + * http://www.sgi.com + */ + +#include <linux/config.h> +#include <linux/slab.h> +#include <linux/sched.h> +#include <linux/module.h> +#include <linux/rcupdate.h> +#include <linux/pnotify.h> +#include <asm/semaphore.h> + +/* list of pnotify event list entries that reference the "module" + * implementations */ +static LIST_HEAD(pnotify_event_list); +static DECLARE_RWSEM(pnotify_event_list_sem); + + +/** + * pnotify_get_subscriber - get a pnotify subscriber given a search key + * @task: We examine the pnotify_subscriber_list from the given task + * @key: Key name of kernel module subscriber we wish to retrieve + * + * Given a pnotify_subscriber_list structure, this function will return + * a pointer to the kernel module pnotify_subsciber struct that matches the + * search key. If the key is not found, the function will return NULL. + * + * Locking: This is a pnotify_subscriber_list reader. This function should be + * called within the protection of rcu_read_lock(). + * + */ +struct pnotify_subscriber * +pnotify_get_subscriber(struct task_struct *task, char *key) +{ + struct pnotify_subscriber *subscriber; + + list_for_each_entry_rcu(subscriber, &task->pnotify_subscriber_list, entry) { + if (!strcmp(subscriber->events->name,key)) + return subscriber; + } + return NULL; +} + + +/** + * pnotify_subscribe - Add kernel module to the subscriber list for process + * @task: Task that gets the new kernel module subscriber added to the list + * @events: pnotify_events structure to associate with kernel module + * + * Given a task and a pnotify_events structure, this function will allocate + * a new pnotify_subscriber, initialize the settings, and insert it into + * the pnotify_subscriber_list for the task. + * + * Locking: + * The caller for this function should hold at least a read lock on the + * pnotify_event_list_sem - or ensure that the pnotify_events entry cannot be + * removed. If this function was called from the pnotify module (usually the + * case), then the caller need not hold this lock. + * + * This is a pnotify_subscriber_list WRITER. The caller must hold a write + * lock on for the tasks pnotify_subscriber_list_sem. This can be locked + * using down_write(&task->pnotify_subscriber_list_sem) + * + */ +struct pnotify_subscriber * +pnotify_subscribe(struct task_struct *task, struct pnotify_events *events) +{ + struct pnotify_subscriber *subscriber; + + subscriber = kmalloc(sizeof(struct pnotify_subscriber), GFP_KERNEL); + if (!subscriber) + return NULL; + + subscriber->events = events; + subscriber->data = NULL; + atomic_inc(&events->refcnt); /* Increase events reference count */ + list_add_tail_rcu(&subscriber->entry, &task->pnotify_subscriber_list); + return subscriber; +} + +/** + * pnotify_unsubscribe_rcu - Free up the pnotify_subscriber when RCU is ready + * @rcu - the rcu_head to retrieve the pointer to free from + * + */ +static void +pnotify_unsubscribe_rcu (struct rcu_head *rcu) { + struct pnotify_subscriber *sub = container_of(rcu, + struct pnotify_subscriber, rcu); + + kfree(sub); +} + +/** + * pnotify_unsubscribe - Remove kernel module assocation from process + * @subscriber: The subscriber to remove + * + * This function will ensure the subscriber is deleted form + * the list of subscribers for the task. Finally, the memory for the + * subscriber is discarded. + * + * Prior to calling pnotify_unsubscribe, the subscriber should have been + * detached from any uses the kernel module may have. This is often done using + * p->events->exit(task, subscriber); + * + * Locking: + * This is a pnotify_subscriber_list WRITER. The caller of this function must + * hold a write lock on the pnotify_subscriber_list_sem for the task. This can + * be locked using down_write(&task->pnotify_subscriber_list_sem). + * + */ +void +pnotify_unsubscribe(struct pnotify_subscriber *subscriber) +{ + atomic_dec(&subscriber->events->refcnt); /* decr the ref cnt on events */ + list_del_rcu(&subscriber->entry); + call_rcu(&subscriber->rcu, pnotify_unsubscribe_rcu); +} + + +/** + * pnotify_get_events - Get the pnotify_events struct matching requested name + * @key: The name of the events structure to get + * + * Given a pnotify_events struct name that represents the kernel module name, + * this functil will return a pointer to the pnotify_events structure that + * matches the name. + * + * Locking: + * You should hold either the write or read lock for pnotify_event_list_sem + * before using this function. This will ensure that the pnotify_event_list + * does not change while iterating through the list entries. + * + */ +static struct pnotify_events * +pnotify_get_events(char *key) +{ + struct pnotify_events *events; + + list_for_each_entry(events, &pnotify_event_list, entry) { + if (!strcmp(events->name, key)) { + return events; + } + } + return NULL; +} + +/** + * remove_subscriber_from_all_tasks - Remove subscribers for given events struct + * @events: pnotify_events struct for subscribers to remove + * + * Given a kernel module events struct registered with pnotify, + * this functil will remove all subscribers matching the events struct from + * all tasks. + * + * If there is a exit function associated with the subscriber, it is called + * before the subscriber is unsubscribed/freed. + * + * This is meant to be used by pnotify_register and pnotify_unregister + * + * Locking: This is a WRITER so the pnotify_subscriber_list_sem is locked + * and RCU protections are used within this function. Callers don't need + * to do any special locking here. + * + */ +static void +remove_subscriber_from_all_tasks(struct pnotify_events *events) +{ + if (events == NULL) + return; + + /* Because of internal race conditions we can't gaurantee + * getting every task in just one pass so we just keep going + * until there are no tasks with subscribers from this events struct + * attached. The inefficiency of this should be tempered by the fact that + * this happens at most once for each registered client. + */ + while (atomic_read(&events->refcnt) != 0) { + struct task_struct *g = NULL, *p = NULL; + + read_lock(&tasklist_lock); + do_each_thread(g, p) { + struct pnotify_subscriber *subscriber; + int task_exited; + + get_task_struct(p); + read_unlock(&tasklist_lock); + rcu_read_lock(); + down_write(&p->pnotify_subscriber_list_sem); + subscriber = pnotify_get_subscriber(p, events->name); + if (subscriber != NULL) { + (void)events->exit(p, subscriber); + pnotify_unsubscribe(subscriber); + } + up_write(&p->pnotify_subscriber_list_sem); + rcu_read_unlock(); + read_lock(&tasklist_lock); + + /* If a task exited while we were looping, its sibling list would be + * empty. In that case, we jump out of the do_each_thread and loop + * again in the outter while because the reference count probably + * isn't zero for the pnotify events yet. Doing it this way makes + * it so we don't hold the tasklist lock too long. + */ + task_exited = list_empty(&p->sibling); + put_task_struct(p); + if (task_exited) + goto endloop; + } while_each_thread(g, p); + endloop: + read_unlock(&tasklist_lock); + } +} + +/** + * pnotify_register - Register a new module subscriber and enter it in the list + * @events_new: The new pnotify events structure to register. + * + * Used to register a new module subscriber pnotify_events structure and enter + * it into the pnotify_event_list. The service name for a pnotify_events + * struct is restricted to 32 characters. + * + * If an "init()" function is supplied in the events struct being registered + * then the kernel module will be subscribed to all existing tasks and the + * supplied "init()" function will be applied to it. If any call to the + * supplied "init()" function returns a non zero result, the registration will + * be aborted. As part of the abort process, all subscribers belonging to the + * new client will be removed from all tasks and the supplied "detach()" + * function will be called on them. + * + * If a memory error is encountered, the module (pnotify_events structure) + * is unregistered and any tasks we became subscribed to are detached. + * + * Locking: This function is an event list writer as well as a + * pnotfiy_subscriber_list writer. This function performs the event list + * write locks and the pnotify_subscriber_list write locks and RCU + * protections. Callers don't have to do anyting for locking here. + * + */ +int +pnotify_register(struct pnotify_events *events_new) +{ + struct pnotify_events *events = NULL; + + /* Add new pnotify module to access list */ + if (!events_new) + return -EINVAL; /* error */ + if (!list_empty(&events_new->entry)) + return -EINVAL; /* error */ + if (events_new->name == NULL || strlen(events_new->name) > PNOTIFY_NAMELN) + return -EINVAL; /* error */ + if (!events_new->fork || !events_new->exit) + return -EINVAL; /* error */ + + /* Try to insert new events entry into the events list */ + down_write(&pnotify_event_list_sem); + + events = pnotify_get_events(events_new->name); + + if (events) { + up_write(&pnotify_event_list_sem); + printk(KERN_WARNING "Attempt to register duplicate" + " pnotify support (name=%s)\n", events_new->name); + return -EBUSY; + } + + /* Okay, we can insert into the events list */ + list_add_tail(&events_new->entry, &pnotify_event_list); + /* set the ref count to zero */ + atomic_set(&events_new->refcnt, 0); + + /* Now we can call the initializer function (if present) for each task */ + if (events_new->init != NULL) { + struct task_struct *g = NULL, *p = NULL; + int init_result = 0; + + /* Because of internal race conditions we can't guarantee + * getting every task in just one pass so we just keep going + * until we don't find any unitialized tasks. The inefficiency + * of this should be tempered by the fact that this happens + * at most once for each registered client. + */ + read_lock(&tasklist_lock); + repeat: + do_each_thread(g, p) { + struct pnotify_subscriber *subscriber; + int task_exited; + + get_task_struct(p); + read_unlock(&tasklist_lock); + rcu_read_lock(); + down_write(&p->pnotify_subscriber_list_sem); + subscriber = pnotify_get_subscriber(p, events_new->name); + if (!subscriber && !(p->flags & PF_EXITING)) { + subscriber = pnotify_subscribe(p, events_new); + if (subscriber != NULL) { + init_result = events_new->init(p, subscriber); + + /* Success, but init function pointer doesn't want this funct. + * on the subscriber list. */ + if (init_result > 0) + pnotify_unsubscribe(subscriber); + } + else + init_result = -ENOMEM; + } + up_write(&p->pnotify_subscriber_list_sem); + rcu_read_unlock(); + read_lock(&tasklist_lock); + /* Like in remove_subscriber_from_all_tasks, if the task + * disappeared on us while we were going through the + * for_each_thread loop, we need to start over with that loop. + * That's why we have the list_empty here */ + task_exited = list_empty(&p->sibling); + put_task_struct(p); + if (init_result < 0) + goto endloop; + if (task_exited) + goto repeat; + } while_each_thread(g, p); + endloop: + read_unlock(&tasklist_lock); + + /* + * if anything went wrong during initialization abandon the + * registration process + */ + if (init_result < 0) { + remove_subscriber_from_all_tasks(events_new); + list_del_init(&events_new->entry); + up_write(&pnotify_event_list_sem); + + printk(KERN_WARNING "Registering pnotify support for" + " (name=%s) failed\n", events_new->name); + + return init_result; /* hook init function error result */ + } + } + + up_write(&pnotify_event_list_sem); + + printk(KERN_INFO "Registering pnotify support for (name=%s)\n", + events_new->name); + + return 0; /* success */ + +} + +/** + * pnotify_unregister - Unregister kernel module/pnotify_event struct + * @event_old: pnotify_event struct for the kernel module we're unregistering + * + * Used to unregister kernel module subscribers indicated by + * pnotify_events struct. Removes them from the list of kernel modules + * in pnotify_event_list. + * + * Once the events entry in the pnotify_event_list is found, subscribers for + * this kernel module have their exit functions called and will then be + * removed from the list. + * + * Locking: This function is a pnotify_event_list writer. It also calls + * remove_subscriber_from_all_tasks, which is a pnotify_subscriber_list + * writer. Callers don't have to do any locking ahead of this function. + * + */ +int +pnotify_unregister(struct pnotify_events *events_old) +{ + struct pnotify_events *events; + + /* Check the validity of the arguments */ + if (!events_old) + return -EINVAL; /* error */ + if (list_empty(&events_old->entry)) + return -EINVAL; /* error */ + if (events_old->name == NULL) + return -EINVAL; /* error */ + + down_write(&pnotify_event_list_sem); + + events = pnotify_get_events(events_old->name); + + if (events && events == events_old) { + remove_subscriber_from_all_tasks(events); + list_del_init(&events->entry); + up_write(&pnotify_event_list_sem); + + printk(KERN_INFO "Unregistering pnotify support for" + " (name=%s)\n", events_old->name); + + return 0; /* success */ + } + + up_write(&pnotify_event_list_sem); + + printk(KERN_WARNING "Attempt to unregister pnotify support (name=%s)" + " failed - not found\n", events_old->name); + + return -EINVAL; /* error */ +} + + +/** + * __pnotify_fork - Add kernel module subscriber to same subscribers as parent + * @to_task: The child task that will inherit the parent's subscribers + * @from_task: The parent task + * + * Used to attach a new task to the same subscribers the parent has in its + * subscriber list. + * + * The "from" argument is the parent task. The "to" argument is the child + * task. + * + * See Documentation/pnotify.txt for details on + * how to handle return codes from the attach function pointer. + * + * Locking: The to_task is currently in-construction, so we don't + * need to worry about write-locks. We do need to be sure the parent's + * subscriber list, which we copy here, doesn't go away on us. This is + * done via RCU. + * + */ +int +__pnotify_fork(struct task_struct *to_task, struct task_struct *from_task) +{ + struct pnotify_subscriber *from_subscriber; + int ret; + + /* We need to be sure the parent's list we copy from doesn't disappear */ + rcu_read_lock(); + + list_for_each_entry_rcu(from_subscriber, &from_task->pnotify_subscriber_list, entry) { + struct pnotify_subscriber *to_subscriber = NULL; + + to_subscriber = pnotify_subscribe(to_task, from_subscriber->events); + if (!to_subscriber) { + ret=-ENOMEM; + __pnotify_exit(to_task); + rcu_read_unlock(); + return ret; + } + ret = to_subscriber->events->fork(to_task, to_subscriber, + from_subscriber->data); + + if (ret < 0) { + /* Propagates to copy_process as a fork failure */ + /* No __pnotify_exit because there is one in the failure path + * for copy_process in fork.c */ + rcu_read_unlock(); + return ret; /* Fork failure */ + } + else if (ret > 0) { + /* Success, but fork function pointer in the pnotify_events structure + * doesn't want the kenrel module subscribed */ + /* Again, this is the in-construction-child so no write lock */ + pnotify_unsubscribe(to_subscriber); + } + } + rcu_read_unlock(); /* no more to do with the parent's data */ + + return 0; /* success */ +} + +/** + * __pnotify_exit - Remove all subscribers from given task + * @task: Task to remove subscribers from + * + * Locking: This is a pnotify_subscriber_list writer. This function + * write locks the pnotify_subscriber_list and handles RCU protections. + * Callers don't have to do their own locking. The pnotify_events + * structure referenced exit function is called with rcu_read_lock and + * pnotify_subscriber_list write lock held. + * + * + */ +void +__pnotify_exit(struct task_struct *task) +{ + struct pnotify_subscriber *subscriber; + + rcu_read_lock(); + down_write(&task->pnotify_subscriber_list_sem); + + /* Remove ref. to subscribers from task immediately */ + list_for_each_entry_rcu(subscriber, &task->pnotify_subscriber_list, entry) { + + subscriber->events->exit(task, subscriber); + pnotify_unsubscribe(subscriber); + } + + up_write(&task->pnotify_subscriber_list_sem); + rcu_read_unlock(); + + return; +} + + +/** + * __pnotify_exec - Execute exec callback for each subscriber in this task + * @task: We go through the subscriber list in the given task + * + * Used to when a process that has a subscriber list does an exec. + * + * Locking: This is a pnotify_subscriber_list reader and implements RCU + * protections. Callers don't need to do their own locking. The + * pnotify_events referenced exec function pointer is called in an + * environment with rcu_read_lock but _no_ pnotify_subscriber_list write + * lock in force. + * + */ +int +__pnotify_exec(struct task_struct *task) +{ + struct pnotify_subscriber *subscriber; + + rcu_read_lock(); + + list_for_each_entry_rcu(subscriber, &task->pnotify_subscriber_list, entry) { + if (subscriber->events->exec) /* conditional because it's optional */ + subscriber->events->exec(task, subscriber); + } + + rcu_read_unlock(); + return 0; +} + + +EXPORT_SYMBOL_GPL(pnotify_get_subscriber); +EXPORT_SYMBOL_GPL(pnotify_subscribe); +EXPORT_SYMBOL_GPL(pnotify_unsubscribe); +EXPORT_SYMBOL_GPL(pnotify_register); +EXPORT_SYMBOL_GPL(pnotify_unregister); Index: linux/include/linux/pnotify.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux/include/linux/pnotify.h 2005-09-27 15:29:58.992034033 -0500 @@ -0,0 +1,265 @@ +/* + * Process Notification (pnotify) interface + * + * + * Copyright (c) 2000-2002, 2004-2005 Silicon Graphics, Inc. + * All Rights Reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * + * Contact information: Silicon Graphics, Inc., 1500 Crittenden Lane, + * Mountain View, CA 94043, or: + * + * http://www.sgi.com + * + * For further information regarding this notice, see: + * + * http://oss.sgi.com/projects/GenInfo/NoticeExplan + */ + +/* + * Data structure definitions and function prototypes used to implement + * process notification (pnotify). + * + * pnotify provides a method (service) for kernel modules to be notified when + * certain events happen in the life of a process. It also provides a + * data pointer that is associated with a given process. See + * Documentation/pnotify.txt for a full description. + */ + +#ifndef _LINUX_PNOTIFY_H +#define _LINUX_PNOTIFY_H + +#include <linux/sched.h> +#include <linux/rcupdate.h> + +#ifdef CONFIG_PNOTIFY + +#define PNOTIFY_NAMELN 32 /* Max chars in PNOTIFY kernel module name */ + +#define PNOTIFY_ERROR -1 /* Error. Fork fail for pnotify_fork */ +#define PNOTIFY_OK 0 /* All is well, stay subscribed */ +#define PNOTIFY_NOSUB 1 /* All is well but don't subscribe module + * to subscriber list for the process */ + + +/** + * INIT_PNOTIFY_LIST - init a pnotify subscriber list struct after declaration + * @_l: Task struct to init the pnotify_module_subscriber_list and semaphore + * + */ +#define INIT_PNOTIFY_LIST(_l) \ +do { \ + INIT_LIST_HEAD(&(_l)->pnotify_subscriber_list); \ + init_rwsem(&(_l)->pnotify_subscriber_list_sem); \ +} while(0) + +/* + * Used by task_struct to manage list of subscriber kernel modules for the + * process. Each pnotify_subscriber provides the link between the process + * and the correct kernel module subscriber. + * + * STRUCT MEMBERS: + * pnotify_events: Reference to pnotify_events structure, which + * holds the name key and function pointers. + * data: Opaque data pointer - defined by pnotify kernel modules. + * entry: List pointers + * rcu: rcu head entry + */ +struct pnotify_subscriber { + struct pnotify_events *events; + void *data; + struct list_head entry; + struct rcu_head rcu; +}; + +/* + * Used by pnotify modules to define the callback functions into the + * module. See Documentation/pnotify.txt for details. + * + * STRUCT MEMBERS: + * name: The name of the pnotify container type provided by + * the module. This will be set by the pnotify module. + * fork: Function pointer to function used when associating + * a forked process with a kernel module referenced by + * this struct. pnotify.txt will provide details on + * special return codes interpreted by pnotify. + * + * exit: Function pointer to function used when a process + * associated with the kernel module owning this struct + * exits. + * + * init: Function pointer to initialization function. This + * function is used when the module registers with pnotify + * to associate existing processes with the referring + * kernel module. This is optional and may be set to NULL + * if it is not needed by the pnotify kernel module. + * + * Note: The return values are managed the same way as in + * fork above. Except, of course, an error doesn't + * result in a fork failure. + * + * Note: The implementation of pnotify_register causes + * us to evaluate some tasks more than once in some cases. + * See the comments in pnotify_register for why. + * Therefore, if the init function pointer returns + * PNOTIFY_NOSUB, which means that it doesn't want this + * process associated with the kernel module, that init + * function must be prepared to possibly look at the same + * "skipped" task more than once. + * + * data: Opaque data pointer - defined by pnotify modules. + * module: Pointer to kernel module struct. Used to increment & + * decrement the use count for the module. + * entry: List pointers + * exec: Function pointer to function used when a process + * this kernel module is subscribed to execs. This + * is optional and may be set to NULL if it is not + * needed by the pnotify module. + * refcnt: Keep track of user count of pnotify_events + */ +struct pnotify_events { + struct module *module; + char *name; /* Name Key - restricted to 32 chars */ + void *data; /* Opaque module specific data */ + struct list_head entry; /* List pointers */ + atomic_t refcnt; /* usage counter */ + int (*init)(struct task_struct *, struct pnotify_subscriber *); + int (*fork)(struct task_struct *, struct pnotify_subscriber *, void*); + void (*exit)(struct task_struct *, struct pnotify_subscriber *); + void (*exec)(struct task_struct *, struct pnotify_subscriber *); +}; + + +/* Kernel service functions for providing pnotify support */ +extern struct pnotify_subscriber *pnotify_get_subscriber(struct task_struct + *task, char *key); +extern struct pnotify_subscriber *pnotify_subscribe(struct task_struct *task, + struct pnotify_events *pt); +extern void pnotify_unsubscribe(struct pnotify_subscriber *subscriber); +extern int pnotify_register(struct pnotify_events *pt_new); +extern int pnotify_unregister(struct pnotify_events *pt_old); +extern int __pnotify_fork(struct task_struct *to_task, + struct task_struct *from_task); +extern void __pnotify_exit(struct task_struct *task); +extern int __pnotify_exec(struct task_struct *task); + +/** + * pnotify_fork - child inherits subscriber list associations of its parent + * @child: child task - to inherit + * @parent: parenet task - child inherits subscriber list from this parent + * + * Function used when a child process must inherit subscriber list assocation + * from the parent. Return code is propagated as a fork fail. + * This function does the quick check to see if the subscriber list is empty. + * __pnotify_fork does the rest. If the list is no longer empty by the + * time __pnotify_fork runs, that's OK. + * + * Locking: This is a pnotify_subscriber_list reader, handled by RCU. + * + */ +static inline int pnotify_fork(struct task_struct *child, + struct task_struct *parent) +{ + INIT_PNOTIFY_LIST(child); + rcu_read_lock(); + if (!list_empty(&parent->pnotify_subscriber_list)) { + rcu_read_unlock(); + return __pnotify_fork(child, parent); + } + + rcu_read_unlock(); + return 0; +} + + +/** + * pnotify_exit - Detach subscriber kernel modules from this process + * @task: The task the subscribers will be detached from + * + * This does a quick check to see if the subscriber list is empty. + * If the list becomes empty by the time __pnotify_exit is called, that's + * OK. + * + * Locking: This is a pnotify_subscriber_list reader, handled by RCU. + * Callers don't need to do their own locking. + * + */ +static inline void pnotify_exit(struct task_struct *task) +{ + rcu_read_lock(); + if (!list_empty(&task->pnotify_subscriber_list)) { + rcu_read_unlock(); + __pnotify_exit(task); + } + else { + rcu_read_unlock(); + } +} + +/** + * pnotify_exec - Used when a process exec's + * @task: The process doing the exec + * + * This does a quick check to see if the subscriber list is empty. + * If the list becomes empty by the time __pnotify_exit is called, that's + * OK. + * + * Locking: This is a pnotify_subscriber_list reader, handled by RCU. + * Callers don't need to do their own locking. + * + */ +static inline void pnotify_exec(struct task_struct *task) +{ + rcu_read_lock(); + if (!list_empty(&task->pnotify_subscriber_list)) { + rcu_read_unlock(); + __pnotify_exec(task); + } + else { + rcu_read_unlock(); + } +} + +/** + * INIT_TASK_PNOTIFY - Used in INIT_TASK to set head and sem of subscriber list + * @tsk: The task work with + * + * Marco Used in INIT_TASK to set the head and sem of pnotify_subscriber_list + * If CONFIG_PNOTIFY is off, it is defined as an empty macro below. + * + */ +#define INIT_TASK_PNOTIFY(tsk) \ + .pnotify_subscriber_list = LIST_HEAD_INIT(tsk.pnotify_subscriber_list),\ + .pnotify_subscriber_list_sem = \ + __RWSEM_INITIALIZER(tsk.pnotify_subscriber_list_sem), + +#else /* CONFIG_PNOTIFY */ + +/* + * Replacement macros used when pnotify (Process Notification) support is not + * compiled into the kernel. + */ +#define INIT_TASK_PNOTIFY(tsk) +#define INIT_PNOTIFY_LIST(l) do { } while(0) +#define pnotify_fork(ct, pt) ({ 0; }) +#define pnotify_exit(t) do { } while(0) +#define pnotify_exec(t) do { } while(0) +#define pnotify_unsubscribe(t) do { } while(0) + +#endif /* CONFIG_PNOTIFY */ + +#endif /* _LINUX_NOTIFY_H */ Index: linux/Documentation/pnotify.txt =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux/Documentation/pnotify.txt 2005-09-27 15:41:40.974232322 -0500 @@ -0,0 +1,369 @@ +Process Notification (pnotify) +-------------------- +pnotify provides a method (service) for kernel modules to be notified when +certain events happen in the life of a process. Events we support include +fork, exit, and exec. A special init event is also supported (see events +below). More events could be added. pnotify also provides a generic data +pointer for the modules to work with so that data can be associated per +process. + +A kernel module will register (pnotify_register) a service request describing +events it cares about (pnotify_events) with pnotify_register. The request +tells pnotify which notifications the kernel module wants. The kernel module +passes along function pointers to be called for these events (exit, fork, exec) +in the pnotify_events service request. + +From the process point of view, each process has a kernel module subscriber +list (pnotify_subscriber_list). These kernel modules are the ones who want +notification about the life of the process. As described above, each kernel +module subscriber on the list has a generic data pointer to point to data +associated with the process. + +In the case of fork, pnotify will allocate the same kernel module subscriber +list for the new child that existed for the parent. The kernel module's +function pointer for fork is also called for the child being constructed so +the kernel module can do what ever it needs to do when a parent forks this +child. Special return values apply for the fork and init event that don't to +others. They are described in the fork and init example below. + +For exit, similar things happen but the exit function pointer for each +kernel module subscriber is called and the kernel module subscriber entry for +that process is deleted. + + +Events +------ +Events are stages of a processes life that kernel modules care about. The +fork event is triggered in a certain location in copy_process when a parent +forks. The exit event happens when a process is going away. We also support +an exec event, which happens when a process execs. Finally, there is an init +event. This special event makes it so this kernel module will be associated +with all current processes in the system at the time of registration. This is +used when a kernel module wants to keep track of all current processes as +opposed to just those it associates by itself (and children that follow). The +events a kernel module cares about are set up in the pnotify_events +structure - see usage below. + +When setting up a pnotify_events, you designate which events you care about +by either associating NULL (meaning you don't care about that event) or a +pointer to the function to run when the event is triggered. The fork event +and the exit event is currently required. + + +How do processes become associated with kernel modules? +------------------------------------------------------- +Your kernel module itself can use the pnotify_subscribe function to associate +a given process with a given pnotify_events structure. This adds +your kernel module to the subscriber list of the process. In the case +of inescapable job containers making use of PAM, when PAM allows a person to +log in, PAM contacts job (via a PAM job module which uses the job userland +library) and the kernel Job code will call pnotify_subscribe to associate the +process with pnotify. From that point on, the kernel module will be notified +about events in the process's life that the module cares about (as well, +as any children that process may later have). + +Likewise, your kernel module can remove an association between it and +a given process by using pnotify_unsubscribe. + + +Example Usage +------------- + +=== filling out the pnotify_events structure === + +A kernel module wishing to use pnotify needs to set up a pnotify_events +structure. This structure tells pnotify which events you care about and what +functions to call when those events are triggered. In addition, you supply a +name (usually the kernel module name). The entry is always filled out as +shown below. .module is usually set to THIS_MODULE. data can be optionally +used to store a pointer with the pnotify_events structure. + +Example of a filled out pnotify_events: + +static struct pnotify_events pnotify_events = { + .module = THIS_MODULE, + .name = "test_module", + .data = NULL, + .entry = LIST_HEAD_INIT(pnotify_events.entry), + .init = test_init, + .fork = test_attach, + .exit = test_detach, + .exec = test_exec, +}; + +The above pnotify_events structure says the kernel module "test_module" cares +about events fork, exit, exec, and init. In fork, call the kernel module's +test_attach function. In exec, call test_exec. In exit, call test_detach. +The init event is specified, so all processes on the system will be associated +with this kernel module during registration and the test_init function will +be run for each. + + +=== Registering with pnotify === + +You will likely register with pnotify in your kernel module's module_init +function. Here is an example: + +static int __init test_module_init(void) +{ + int rc = pnotify_register(&pnotify_events); + if (rc < 0) { + return -1; + } + + return 0; +} + + +=== Example init event function ==== + +Since the init event is defined, it means this kernel module is added +to the subscriber list of all processes -- it will receive notification +about events it cares about for all processes and all children that +follow. + +Of course, if a kernel module doesn't need to know about all current +processes, that module shouldn't implement this and '.init' in the +pnotify_events structure would be NULL. + +This is as opposed to the normal method where the kernel module adds itself +to the subscriber list of a process using pnotify_subscribe. + +Important: +Note: The implementation of pnotify_register causes us to evaluate some tasks +more than once in some cases. See the comments in pnotify_register for why. +Therefore, if the init function pointer returns PNOTIFY_NOSUB, which means +that it doesn't want a process association, that init function must be +prepared to possibly look at the same "skipped" task more than once. + +Note that the return value here is similar to the fork function pointer +below except there is no notion of failing the fork since existing processes +aren't forking. + +PNOTIFY_OK - good, adds the kernel module to the subscriber list for process +PNOTIFY_NOSUB - good, but don't add kernel module to subscriber list for process + +static int test_init(struct task_struct *tsk, struct pnotify_subscriber *subscriber) +{ + if (pnotify_get_subscriber(tsk, "test_module") == NULL) + dprintk("ERROR pnotify expected \"%s\" PID = %d\n", "test_module", tsk->pid); + + dprintk("FYI pnotify init hook fired for PID = %d\n", tsk->pid); + atomic_inc(&init_count); + return 0; +} + + +=== Example fork (test_attach) function === + +This function is executed when a process forks - this is associated +with the pnotify_callout callout in copy_process. There would be a very +similar test_detach function (not shown). + +pnotify will add the kernel module to the notification list for the child +process automatically and then execute this fork function pointer (test_attach +in this example). However, the kernel module can control whether the kernel +module stays on the process's subscriber list and wants notification by the +return value. + +PNOTIFY_ERROR - prevent the process from continuing - failing the fork +PNOTIFY_OK - good, adds the kernel module to the subscriber list for process +PNOTIFY_NOSUB - good, but don't add kernel module to subscriber list for process + + +static int test_attach(struct task_struct *tsk, struct pnotify_subscriber *subscriber, void *vp) +{ + dprintk("pnotify attach hook fired for PID = %d\n", tsk->pid); + atomic_inc(&attach_count); + + return PNOTIFY_OK; +} + + +=== Example exec event function === + +And here is an example function to run when a task gets to exec. So any +time a "tracked" process gets to exec, this would execute. + +static void test_exec(struct task_struct *tsk, struct pnotify_subscriber *subscriber) +{ + dprintk("pnotify exec hook fired for PID %d\n", tsk->pid); + atomic_inc(&exec_count); +} + + +=== Unregistering with pnotify === + +You will likely wish to unregister with pnotify in the kernel module's +module_exit function. Here is an example: + +static void __exit test_module_cleanup(void) +{ + pnotify_unregister(&pnotify_events); + printk("detach called %d times...\n", atomic_read(&detach_count)); + printk("attach called %d times...\n", atomic_read(&attach_count)); + printk("init called %d times...\n", atomic_read(&init_count)); + printk("exec called %d times ...\n", atomic_read(&exec_count)); + if (atomic_read(&attach_count) + atomic_read(&init_count) != + atomic_read(&detach_count)) + printk("pnotify PROBLEM: attach count + init count SHOULD equal detach cound and doesn't\n"); + else + printk("Good - attach count + init count equals detach count.\n"); +} + + + +=== Actually using data associated with the process in your module === + +The above examples show you how to create an example kernel module using +pnotify, but they didn't show what you might do with the data pointer +associated with a given process. Below, find an example of accessing +the data pointer for a given process from within a kernel module making use +of pnotify. + +pnotify_get_subscriber is used to retrieve the pnotify subscriber for a given +process and kernel module. Like this: + +subscriber = pnotify_get_subscriber(task, name); + +Where name is your kernel module's name (as provided in the pnotify_events +structure) and task is the process you're interested in. + +Please be careful about locking. pnotify makes use of RCU and uses a rwsem +for write locking to ensure we don't have multiple writers. The +pnotify_subscriber_list_sem, part of a task, is used for the write lock. This +example retrieves the widigitId variable that is associated with the task +It retrieves the task in a way that ensures it doesn't disappear while we try +to access it (that's why we do locking for the tasklist_lock and task). The +rcu protection ensures pnotify_subscriber_list data doesn't disappear on us +and the rcu_read_lock/unlock calls are required by the pnotify_get_subscriber +function. + + read_lock(&tasklist_lock); + get_task_struct(task); /* Ensure the task doesn't vanish on us */ + read_unlock(&tasklist_lock); /* Unlock the tasklist */ + rcu_read_lock(); + + subscriber = pnotify_get_subscriber(task, name); + if (subscriber) { + /* Get the widgitId associated with this task */ + widgitId = ((widgitId_t *)subscriber->data); + } + put_task_struct(task); /* Done accessing the task */ + rcu_read_unlock(); + + +Future Events +------------- +More events could be added as they are needed. One idea is notification of +guid and uid changes. + +History +------- +Process Notification used to be known as PAGG (Process Aggregates). +It was re-written to be called Process Notification because we believe this +better describes its purpose. Structures and functions were re-named to +be more clear and to reflect the new name. + + +Why Not Notifier Lists? +----------------------- +We investigated the use of notifier lists, available in newer kernels. + +Notifier lists would not be as efficient as pnotify for kernel modules +wishing to associate data with processes. With pnotify, if the +pnotify_subscriber_list of a given task is NULL, we can instantly know +there are no kernel modules that care about the process. Further, the +callbacks happen in places were the task struct is likely to be cached. +So this is a quick operation. With notifier lists, the scope is system +wide rather than per process. As long as one kernel module wants to be +notified, we have to walk the notifier list and potentially waste cycles. +In the case of pnotify, we only walk lists if we're interested about +a specific task. + +On a system where pnotify is used to track only a few processes, the +overhead of walking the notifier list is high compared to the overhead +of walking the kernel module subscriber list only when a kernel module +is interested in a given process. + +I don't believe this is easily solved in notifier lists themselves as +they are meant to be global resources, not per-task resources. + +Overlooking performance issues, notifier lists in and of themselves wouldn't +solve the problem pnotify solves anyway. Although you could argue notifier +lists can implement the callback portion of pnotify, there is no association +of data with a given process. This is a needed for kernel modules to +efficiently associate a task with a data pointer without cluttering up +the task struct. + +In addition to data associated with a process, we desire the ability for +kernel modules to add themselves to the subscriber list for any arbitrary +process - not just current or a child of current. + + +Some Justification +------------------ +We feel that pnotify could be used to reduce the size of the task struct or +the number of functions in copy_process. For example, if another part of the +kernel needs to know when a process is forking or exiting, they could use +pnotify instead of adding additional code to task struct, copy_process, or +exit. + +Some have argued that PAGG in the past shouldn't be used because it will +allow interesting things to be implemented outside of the kernel. While this +might be a small risk, having these in place allows customers and users to +implement kernel components that you don't want to see in the kernel anyway. + +For example, a certain vendor may have an urgent need to implement kernel +functionality or special types of accounting that nobody else is interested +in. That doesn't mean the code isn't open-source, it just means it isn't +applicable to all of Linux because it satisfies a niche. + +All of pnotify's functionality that needs to be exported is exported with +EXPORT_SYMBOL_GPL to discourage abuse. + +The risk already exists in the kernel for people to implement modules outside +the kernel that suffer from less peer review and possibly bad programming +practice. pnotify could add more oppurtunities for out-of-tree kernel module +authors to make new modules. I believe this is somewhat mitigated by the +already-existing 'tainted' warnings in the kernel. + +Other Ideas? +------------ +There have been similar proposals to provide pieces of the pnotify +functionality. If there is a better proposal out there, let's explore it. +Here are some key functions I hope to see in any proposal: + + - Ability to have notification for exec, fork, exit at minimum + - Ability to extend to other callouts later (such as uid/gid changes as + I described earlier) + - Ability for pnotify user modules to implement code that ends up adding + a kernel module subscriber to any arbitrary process (not just current and + its children). + +I believe, if the above are more or less met, we should be in good shape for +our other open source projects such as linux job. + +Variable Name Changes from PAGG to pnotify +------------------------------------------ +PAGG_NAMELEN -> PNOTIFY_NAMELEN +struct pagg -> pnotify_subscriber +pagg_get -> pnotify_get_subscriber +pagg_alloc -> pnotify_subscribe +pagg_free -> pnotify_unsubscribe +pagg_hook_register -> pnotify_register +pagg_hook_unregister -> pnotify_unregister +pagg_attach -> pnotify_fork +pagg_detach -> pnotify_exit +pagg_exec -> pnotify_exec +struct pagg_hook -> pnotify_events + +With pnotify_events (formerly pagg_hook): + attach -> fork + detach -> exit + +Return codes for the init and fork function pointers should use: +PNOTIFY_ERROR - prevent the process from continuing - failing the fork +PNOTIFY_OK - good, adds the kernel module to the subscriber list for process +PNOTIFY_NOSUB - good, but don't add kernel module to subscriber list for process + -- Erik Jacobson - Linux System Software - Silicon Graphics - Eagan, Minnesota |
From: Dipankar S. <dip...@in...> - 2005-09-29 17:16:04
|
On Thu, Sep 29, 2005 at 11:53:28AM -0500, Erik Jacobson wrote: > My feeling is we shouldn't use this RCU-protected-subscriber-list version > of pnotify - see my performance post to follow. It includes notes about > sleeping while rcu_read_lock is held, for exmaple. On a quick look this patch does look bogus to me. > + > +/** > + * pnotify_unsubscribe_rcu - Free up the pnotify_subscriber when RCU is ready > + * @rcu - the rcu_head to retrieve the pointer to free from > + * > + */ > +static void > +pnotify_unsubscribe_rcu (struct rcu_head *rcu) { > + struct pnotify_subscriber *sub = container_of(rcu, > + struct pnotify_subscriber, rcu); > + > + kfree(sub); > +} > + > +/** > + * > + */ > +void > +pnotify_unsubscribe(struct pnotify_subscriber *subscriber) > +{ > + atomic_dec(&subscriber->events->refcnt); /* decr the ref cnt on events */ > + list_del_rcu(&subscriber->entry); > + call_rcu(&subscriber->rcu, pnotify_unsubscribe_rcu); > +} Could you use a per-subscriber reference count ? That will allow you to drop rcu_read_lock() safely. > +static void > +remove_subscriber_from_all_tasks(struct pnotify_events *events) > +{ > + if (events == NULL) > + return; > + > + /* Because of internal race conditions we can't gaurantee > + * getting every task in just one pass so we just keep going > + * until there are no tasks with subscribers from this events struct > + * attached. The inefficiency of this should be tempered by the fact that > + * this happens at most once for each registered client. > + */ > + while (atomic_read(&events->refcnt) != 0) { > + struct task_struct *g = NULL, *p = NULL; > + > + read_lock(&tasklist_lock); > + do_each_thread(g, p) { > + struct pnotify_subscriber *subscriber; > + int task_exited; > + > + get_task_struct(p); > + read_unlock(&tasklist_lock); > + rcu_read_lock(); > + down_write(&p->pnotify_subscriber_list_sem); Wrong. Will refcounting suscrbiber itself here be costly ? Thanks Dipankar |
From: Erik J. <er...@sg...> - 2005-09-29 18:10:04
|
> Could you use a per-subscriber reference count ? That will allow > you to drop rcu_read_lock() safely. In the performance tests, I showed that I really couldn't measure the speed difference in AIM7 and fork bomb tests using the original rwsems and a kernel without pnotify and job. The process for the tests and any children processes all had two subscribers (keyring and job) at the time. I'm just not convinced that RCU is the right fit for pnotify in general since I don't see a speed gain. If my proof of concept rcu pnotify patch is so bad that it can't even be used to gage some performance numbers, I'm happy to try new things. However, it seems to me that most methods for fixing the rcu pnotify patch would decrease efficiency rather than increase it. As was pointed out to me in a discussion in the pagg mailing list, we could be in a situation where we normally have as many writers as readers for many situations. I'm not sure the rule of thumb for writers vs readers points to a good match for RCU. I'm not saying I'm opposed to trying things that you suggest with RCU if you think it's worth the effort. I was just pointing out my thoughts on the matter and welcoming input. Erik |
From: Dipankar S. <dip...@in...> - 2005-09-29 19:17:49
|
On Thu, Sep 29, 2005 at 01:09:16PM -0500, Erik Jacobson wrote: > > Could you use a per-subscriber reference count ? That will allow > > you to drop rcu_read_lock() safely. > > In the performance tests, I showed that I really couldn't measure > the speed difference in AIM7 and fork bomb tests using the original > rwsems and a kernel without pnotify and job. The process for the tests and > any children processes all had two subscribers (keyring and job) at the > time. I'm just not convinced that RCU is the right fit for pnotify in > general since I don't see a speed gain. > > If my proof of concept rcu pnotify patch is so bad that it can't even be > used to gage some performance numbers, I'm happy to try new things. > However, it seems to me that most methods for fixing the rcu pnotify patch > would decrease efficiency rather than increase it. > > As was pointed out to me in a discussion in the pagg mailing list, we > could be in a situation where we normally have as many writers as > readers for many situations. I'm not sure the rule of thumb for writers vs > readers points to a good match for RCU. Oh, I am only pointing out RCU problems. It does make sense to do some benchmarking and see if it has benefits over rwsem or not. I would like to see the comparison on one of those SGI behemoths instead of a 2-cpu box :) Thanks Dipankar |
From: Erik J. <er...@sg...> - 2005-09-29 19:26:37
|
> Oh, I am only pointing out RCU problems. It does make sense to do > some benchmarking and see if it has benefits over rwsem or not. > I would like to see the comparison on one of those SGI behemoths > instead of a 2-cpu box :) I can run the same tests on a bigger box, sure. I guess the host name in the AIM output isn't even that exciting for you -- minime1 :) I'll get some time on a larger system and get back to you. Thanks! -- Erik Jacobson - Linux System Software - Silicon Graphics - Eagan, Minnesota |
From: Erik J. <er...@sg...> - 2005-09-29 16:56:26
|
This is a version of Job modified for use with the RCU version of pnotify. This patch allocates memory and checks rwsems's while rcu_read_lock is active and is probably illegal in that sense. See my post on performance data to follow shortly for a discussion on that. Documentation/job.txt | 104 ++ include/linux/job_acct.h | 124 +++ include/linux/jobctl.h | 185 ++++ init/Kconfig | 29 kernel/Makefile | 1 kernel/fork.c | 1 kernel/job.c | 1892 +++++++++++++++++++++++++++++++++++++++++++++++ 7 files changed, 2336 insertions(+) Index: linux/Documentation/job.txt =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux/Documentation/job.txt 2005-09-27 17:53:47.957595073 -0500 @@ -0,0 +1,104 @@ +Linux Jobs - A Process Notification (pnotify) Module +---------------------------------------------------- + +1. Overview + +This document provides two additional sections. Section 2 provides a +listing of the manual page that describes the particulars of the Linux +job implementation. Section 3 provides some information about using +the user job library to interface to jobs. + +2. Job Man Page + + +JOB(7) Linux User's Manual JOB(7) + + +NAME + job - Linux Jobs kernel module overview + +DESCRIPTION + A job is a group of related processes all descended from a + point of entry process and identified by a unique job + identifier (jid). A job can contain multiple process + groups or sessions, and all processes in one of these sub- + groups can only be contained within a single job. + + The primary purpose for having jobs is to provide job + based resource limits. The current implementation only + provides the job container and resource limits will be + provided in a later implementation. When an implementa- + tion that provides job limits is available, this descrip- + tion will be expanded to provide further explanation of + job based limits. + + Not every process on the system is part of a job. That + is, only processes which are started by a login initiator + like login, rlogin, rsh and so on, get assigned a job ID. + In the Linux environment, jobs are created via a PAM mod- + ule. + + Jobs on Linux are provided using a loadable kernel module. + Linux jobs have the following characteristics: + + o A job is an inescapable container. A process cannot + leave the job nor can a new process be created outside + the job without explicit action, that is, a system + call with root privilege. + + o Each new process inherits the jid and limits [when + implemented] from its parent process. + + o All point of entry processes (job initiators) create a + new job and set the job limits [when implemented] + appropriately. + + o Job initiation on Linux is performed via a PAM session + module. + + o The job initiator performs authentication and security + checks. + + o Users can raise and lower their own job limits within + maximum values specified by the system administrator + [when implemented]. + + o Not all processes on a system need be members of a job. + + o The process control initialization process (init(1M)) + and startup scripts called by init are not part of a + job. + + + Job initiators can be categorized as either interactive or + batch processes. Limit domain names are defined by the + system administrator when the user limits database (ULDB) + is created. [The ULDB will be implemented in conjunction + with future job limits work.] + + Note: The existing command jobs(1) applies to shell "jobs" + and it is not related to the Linux Kernel Module jobs. + The at(1), atd(8), atq(1), batch(1), atrun(8), atrm(1)) + man pages refer to shell scripts as a job. a shell + script. + +SEE ALSO + job(1), jwait(1), jstat(1), jkill(1) + + + + + + + + + +3. User Job Library + +For developers who wish to make software using Linux Jobs, there exists +a user job library. This library contains functions for obtaining information +about running jobs, creating jobs, detaching, etc. + +The library is part of the job package and can be obtained from oss.sgi.com +using anonymous ftp. Look in the /projects/pagg/download directory. See the +README in the job source package for more information. Index: linux/init/Kconfig =================================================================== --- linux.orig/init/Kconfig 2005-09-27 17:46:52.034674237 -0500 +++ linux/init/Kconfig 2005-09-27 17:53:47.961500929 -0500 @@ -170,6 +170,35 @@ Linux Jobs module and the Linux Array Sessions module. If you will not be using such modules, say N. +config JOB + tristate " Process Notification (pnotify) based jobs" + depends on PNOTIFY + help + The Job feature implements a type of process aggregate, + or grouping. A job is the collection of all processes that + are descended from a point-of-entry process. Examples of such + points-of-entry include telnet, rlogin, and console logins. + + Batch schedulers such as LSF also make use of Job for containing, + maintaining, and signaling a job as one entity. + + A job differs from a session and process group since the job + container (or group) is inescapable. Only root level processes, + or those with the CAP_SYS_RESOURCE capability, can create new jobs + or escape from a job. + + A job is identified by a unique job identifier (jid). Currently, + that jid can be used to obtain status information about the job + and the processes it contians. The jid can also be used to send + signals to all processes contained in the job. In addition, + other processes can wait for the completion of a job - the event + where the last process contained in the job has exited. + + If you want to compile support for jobs into the kernel, select + this entry using Y. If you want the support for jobs provided as + a module, select this entry using M. If you do not want support + for jobs, select N. + config SYSCTL bool "Sysctl support" ---help--- Index: linux/kernel/job.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux/kernel/job.c 2005-09-28 09:56:01.603487672 -0500 @@ -0,0 +1,1892 @@ +/* + * Linux Job kernel module + * + * + * Copyright (c) 2000-2005 Silicon Graphics, Inc. All Rights Reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * + * Contact information: Silicon Graphics, Inc., 1500 Crittenden Lane, + * Mountain View, CA 94043, or: + * + * http://www.sgi.com + * + * For further information regarding this notice, see: + * + * http://oss.sgi.com/projects/GenInfo/NoticeExplan + */ + +/* + * Description: This file implements a type of process grouping called jobs. + * For further information about jobs, consult the file + * Documentation/job.txt. Jobs are implemented using Process Notification + * (pnotify). For more information about pnotify, see + * Documentation/pnotify.txt. + */ + +/* + * LOCKING INFO + * + * There are currently two levels of locking in this module. So, we + * have two classes of locks: + * + * (1) job table lock (always, job_table_sem) + * (2) job entry lock (usually, job->sem) + * + * Most of the locking used is read/write sempahores. In rare cases, a + * spinlock is also used. Those cases requiring a spinlock concern when the + * tasklist_lock must be locked (such as when looping over all tasks on the + * system). + * + * There is only one job_table_sem. There is a job->sem for each job + * entry in the job_table. This job module uses Process Notification + * (pnotify). Each task has a special lock that protects its pnotify + * information - this is called the pnotify_subscriber_list lock. There are + * special macros used to lock/unlock a task's subscriber list lock. The + * subscriber list lock is really a semaphore. + * + * Purpose: + * + * (1) The job_table_sem protects all entries in the table. + * (2) The job->sem protects all data and task attachments for the job. + * + * Truths we hold to be self-evident: + * + * Only the holder of a write lock for the job_table_lock may add or + * delete a job entry from the job_table. The job_table includes all job + * entries in the hash table and chains off the hash table locations. + * + * Only the holder of a write lock for a job->lock may attach or detach + * processes/tasks from the attached list for the job. + * + * If you hold a read lock of job_table_lock, you can assume that the + * job entries in the table will not change. The link pointers for + * the chains of job entries will not change, the job ID (jid) value + * will not change, and data changes will be (mostly) atomic. + * + * If you hold a read lock of a job->lock, you can assume that the + * attachments to the job will not change. The link pointers for the + * attachment list will not change and the attachments will not change. + * + * pnotify uses RCU protections instead of read locks. If you require + * write access to the tasks's pnotify_subscriber_list, you need to + * down_write(&task->pnotify_subscriber_list_sem). Anywhere reading + * involving the pnotify_subscriber_list is done needs to be protected + * by rcu_read_lock / rcu_read_unlock. If iterators are used on the + * pnotify_subscriber_list, use the rcu aware versions. + * + * If you are going to grab nested locks, the nesting order is: + * + * down_write/up_write(&task->pnotify_subscriber_list_sem) + * job_table_sem + * job->sem + * + * However, it is not strictly necessary to down the job_table_sem + * before downing job->sem. + * + * Also, the nesting order allows you to lock in this order: + * + * down_write/up_write(&task->pnotify_subscriber_list_sem) + * job->sem + * + * without locking job_table_sem between the two. + * + */ + +/* standard for kernel modules */ +#include <linux/config.h> +#include <linux/module.h> +#include <linux/kernel.h> +#include <linux/kmod.h> +#include <linux/init.h> +#include <linux/list.h> + +#include <asm/uaccess.h> /* for get_user & put_user */ + +#include <linux/sched.h> /* for current */ +#include <linux/tty.h> /* for the tty declarations */ +#include <linux/slab.h> +#include <linux/types.h> + +#include <linux/proc_fs.h> + +#include <linux/string.h> +#include <asm/semaphore.h> +#include <linux/moduleparam.h> + +#include <linux/pnotify.h> /* to use process notification service */ +#include <linux/jobctl.h> +#include <linux/job_acct.h> + +MODULE_AUTHOR("Silicon Graphics, Inc."); +MODULE_DESCRIPTION("pnotify based inescapable jobs"); +MODULE_LICENSE("GPL"); + +#define HASH_SIZE 1024 + +/* The states for a job */ +#define RUNNING 1 /* Running job */ +#define ZOMBIE 2 /* Dead job */ + +/* Job creation tags for the job HID (host ID) */ +#define DISABLED 0xffffffff /* New job creation disabled */ +#define LOCAL 0x0 /* Only creating local sys jobs */ + + +#ifdef __BIG_ENDIAN +#define iptr_hid(ll) ((u32 *)&(ll)) +#define iptr_sid(ll) (((u32 *)(&(ll) + 1)) - 1) +#else /* __LITTLE_ENDIAN */ +#define iptr_hid(ll) (((u32 *)(&(ll) + 1)) - 1) +#define iptr_sid(ll) ((u32 *)&(ll)) +#endif /* __BIG_ENDIAN */ + +#define jid_hash(ll) (*(iptr_sid(ll)) % HASH_SIZE) + + +/* Job info entry for member tasks */ +struct job_attach { + struct task_struct *task; /* task we are attaching to job */ + struct pnotify_subscriber *subscriber; /* our subscriber entry in task */ + struct job_entry *job; /* the job we are attaching task to */ + struct list_head entry; /* list stuff */ +}; + +struct job_waitinfo { + int status; /* For tasks waiting on job exit */ +}; + +struct job_csainfo { + u64 corehimem; /* Accounting - highpoint, phys mem */ + u64 virthimem; /* Accounting - highpoint, virt mem */ + struct file *acctfile; /* The accounting file for job */ +}; + +/* Job table entry type */ +struct job_entry { + u64 jid; /* Our job ID */ + int refcnt; /* Number of tasks attached to job */ + int state; /* State of job - RUNNING,... */ + struct rw_semaphore sem; /* lock for the job */ + uid_t user; /* user that owns the job */ + time_t start; /* When the job began */ + struct job_csainfo csa; /* CSA accounting info */ + wait_queue_head_t zombie; /* queue last task - during wait */ + wait_queue_head_t wait; /* queue of tasks waiting on job */ + int waitcnt; /* Number of tasks waiting on job */ + struct job_waitinfo waitinfo; /* Status info for waiting tasks */ + struct list_head attached; /* List of attached tasks */ + struct list_head entry; /* List of other jobs - same hash */ +}; + + +/* Job container tables */ +static struct list_head job_table[HASH_SIZE]; +static int job_table_refcnt = 0; +static DECLARE_RWSEM(job_table_sem); + + +/* Accounting subscriber list */ +static struct job_acctmod *acct_list[JOB_ACCT_COUNT]; +static DECLARE_RWSEM(acct_list_sem); + + +/* Host ID for the localhost */ +static u32 jid_hid; + +static char *hid = NULL; +module_param(hid, charp, 0); + +/* Function prototypes */ +static int job_dispatch_create(struct job_create *); +static int job_dispatch_getjid(struct job_getjid *); +static int job_dispatch_waitjid(struct job_waitjid *); +static int job_dispatch_killjid(struct job_killjid *); +static int job_dispatch_getjidcnt(struct job_jidcnt *); +static int job_dispatch_getjidlst(struct job_jidlst *); +static int job_dispatch_getpidcnt(struct job_pidcnt *); +static int job_dispatch_getpidlst(struct job_pidlst *); +static int job_dispatch_getuser(struct job_user *); +static int job_dispatch_getprimepid(struct job_primepid *); +static int job_dispatch_sethid(struct job_sethid *); +static int job_dispatch_detachjid(struct job_detachjid *); +static int job_dispatch_detachpid(struct job_detachpid *); +static int job_dispatch_attachpid(struct job_attachpid *); +static int job_attach(struct task_struct *, struct pnotify_subscriber *, void *); +static void job_detach(struct task_struct *, struct pnotify_subscriber *); +static struct job_entry *job_getjob(u64 jid); +static int job_dispatcher(unsigned int, unsigned long); + +u64 job_getjid(struct task_struct *); + +int job_ioctl(struct inode *, struct file *, unsigned int, unsigned long); + +/* Job container pnotify service request */ +static struct pnotify_events events = { + .module = THIS_MODULE, + .name = PNOTIFY_JOB, + .data = &job_table, + .entry = LIST_HEAD_INIT(events.entry), + .fork = job_attach, + .exit = job_detach, +}; + +/* proc dir entry */ +struct proc_dir_entry *job_proc_entry; + +/* file operations for proc file */ +static struct file_operations job_file_ops = { + .owner = THIS_MODULE, + .ioctl = job_ioctl +}; + + +/* + * job_getjob - return job_entry given a jid + * @jid: The jid of the job entry we wish to retrieve + * + * Given a jid value, find the entry in the job_table and return a pointer + * to the job entry or NULL if not found. + * + * You should normally down_read the job_table_sem before calling this + * function. + */ +struct job_entry * +job_getjob(u64 jid) +{ + struct list_head *entry = NULL; + struct job_entry *tjob = NULL; + struct job_entry *job = NULL; + + list_for_each(entry, &job_table[ jid_hash(jid) ]) { + tjob = list_entry(entry, struct job_entry, entry); + if (tjob->jid == jid) { + job = tjob; + break; + } + } + return job; +} + + +/* + * job_attach - Attach a task to a specified job + * @task: Task we want to attach to the job + * @new_subscriber: The already allocated subscriber struct for the task + * @old_data: (struct job_attach *)old_data)->job is the specified job + * + * Attach the task to the job specified in the target data (old_data). + * This function will add the task to the list of attached tasks for the job. + * In addition, a link from the task to the job is created and added to the + * task via the data pointer reference. + * + * The process that owns the target data should be at least read locked (using + * down_read(&task->pnotify_subscriber_list_sem)) during this call. This help + * in ensuring that the job cannot be removed since at least one process will + * still be referencing the job (the one owning the target_data). + * + * It is expected that this function will be called from within the + * pnotify_fork() function in the kernel, when forking (do_fork) a child + * process represented by task. + * + * If this function is called form some other point, then it is possible that + * task and data could be altered while going through this function. In such + * a case, the caller should also lock the pnotify_subscriber_list for the task + * task_struct. + * + * the function returns 0 upon success, and -1 upon failure. + */ +static int +job_attach(struct task_struct *task, struct pnotify_subscriber *new_subscriber, + void *old_data) +{ + struct job_entry *job = ((struct job_attach *)old_data)->job; + struct job_attach *attached = NULL; + int errcode = 0; + + /* + * Lock the job for writing. The task owning target_data has its + * pnotify_subscriber_list_sem locked, so we know there is at least one + * active reference to the job - therefore, it cannot have been removed + * before we have gotten this write lock established. + */ + down_write(&job->sem); + + if (job->state == ZOMBIE) { + /* If the job is a zombie (dying), bail out of the attach */ + printk(KERN_WARNING "Attach task(pid=%d) to job" + " failed - job is ZOMBIE\n", + task->pid); + errcode = -EINPROGRESS; + up_write(&job->sem); + goto error_return; + } + + + /* Allocate memory that we will need */ + + attached = (struct job_attach *)kmalloc(sizeof(struct job_attach), + GFP_KERNEL); + if (!attached) { + /* error */ + printk(KERN_ERR "Attach task(pid=%d) to job" + " failed on memory error in kernel\n", + task->pid); + errcode = -ENOMEM; + up_write(&job->sem); + goto error_return; + } + + + attached->task = task; + attached->subscriber = new_subscriber; + attached->job = job; + new_subscriber->data = (void *)attached; + list_add_tail(&attached->entry, &job->attached); + ++job->refcnt; + + up_write(&job->sem); + + return 0; + +error_return: + kfree(attached); + return errcode; +} + + +/* + * job_detach - Detach a task via the pnotify subscriber reference + * @task: The task to be detached + * @subscriber: The pnotify subscriber reference + * + * Detach the task from the job attached to via the pnotify reference. + * This function will remove the task from the list of attached tasks for the + * job specified via the pnotify data pointer. In addition, the link to the + * job provided via the data pointer will also be removed. + * + * The pnotify_subscriber_list should be write locked for task before enterin + * this function (using down_write(&task->pnotify_subscriber_list_sem)). + * + * the function returns 0 uopn success, and -1 uopn failure. + */ +static void +job_detach(struct task_struct *task, struct pnotify_subscriber *subscriber) +{ + struct job_attach *attached = ((struct job_attach *)(subscriber->data)); + struct job_entry *job = attached->job; + struct job_csa csa; + struct job_acctmod *acct; + + /* + * Obtain the lock on the the job_table_sem and the job->sem for + * this job. + */ + down_write(&job_table_sem); + down_write(&job->sem); + + /* - CSA accounting */ + if (acct_list[JOB_ACCT_CSA]) { + acct = acct_list[JOB_ACCT_CSA]; + if (acct->module) { + if (try_module_get(acct->module) == 0) { + printk(KERN_WARNING + "job_detach: Tried to get non-living acct module\n"); + } + } + if (acct->eop) { + csa.job_id = job->jid; + csa.job_uid = job->user; + csa.job_start = job->start; + csa.job_corehimem = job->csa.corehimem; + csa.job_virthimem = job->csa.virthimem; + csa.job_acctfile = job->csa.acctfile; + acct->eop(task->exit_code, task, &csa); + } + if (acct->module) + module_put(acct->module); + } + job->refcnt--; + list_del(&attached->entry); + subscriber->data = NULL; + kfree(attached); + + if (job->refcnt == 0) { + int waitcnt; + + list_del(&job->entry); + --job_table_refcnt; + + /* + * The job is removed from the job_table. + * We can remove the job_table_sem now since + * nobody can access the job via the table. + */ + up_write(&job_table_sem); + + job->state = ZOMBIE; + job->waitinfo.status = task->exit_code; + + waitcnt = job->waitcnt; + + /* + * Release the job semaphore. You cannot hold + * this lock if you want the wakeup to work + * properly. + */ + up_write(&job->sem); + + if (waitcnt > 0) { + wake_up_interruptible(&job->wait); + wait_event(job->zombie, job->waitcnt == 0); + } + + /* + * Job is exiting, all processes waiting for job to exit + * have been notified. Now we call the accountin + * subscribers. + */ + + /* - CSA accounting */ + if (acct_list[JOB_ACCT_CSA]) { + acct = acct_list[JOB_ACCT_CSA]; + if (acct->module) { + if (try_module_get(acct->module) == 0) { + printk(KERN_WARNING + "job_detach: Tried to get non-living acct module\n"); + } + } + if (acct->jobend) { + int res = 0; + + csa.job_id = job->jid; + csa.job_uid = job->user; + csa.job_start = job->start; + csa.job_corehimem = job->csa.corehimem; + csa.job_virthimem = job->csa.virthimem; + csa.job_acctfile = job->csa.acctfile; + + res = acct->jobend(JOB_EVENT_END, + &csa); + if (res) { + printk(KERN_WARNING + "job_detach: CSA -" + " jobend failed.\n"); + } + } + if (acct->module) + module_put(acct->module); + } + /* + * Every process attached or waiting on this job should be + * detached and finished waiting, so now we can free the + * memory for the job. + */ + kfree(job); + + } else { + /* This is case where job->refcnt was greater than 1, so + * we were not going to delete the job after the detach. + * Therefore, only the job->sem is being held - the + * job_table_sem was released earlier. + */ + up_write(&job->sem); + up_write(&job_table_sem); + } + + return; +} + +/* + * job_dispatch_create - create a new job and attach the calling process to it + * @create_args: Pointer of job_create struct which stores the create request + * + * Returns 0 on success, and negative on failure (negative errno value). + */ +static int +job_dispatch_create(struct job_create *create_args) +{ + struct job_create create; + struct job_entry *job = NULL; + struct job_attach *attached = NULL; + struct pnotify_subscriber *subscriber = NULL; + struct pnotify_subscriber *old_subscriber = NULL; + int errcode = 0; + struct job_acctmod *acct = NULL; + static u32 jid_count = 0; + u32 initial_jid_count; + + /* + * if the job ID - host ID segment is set to DISABLED, we will + * not be creating new jobs. We don't mark it as an error, but + * the jid value returned will be 0. + */ + if (jid_hid == DISABLED) { + errcode = 0; + goto error_return; + } + if (!capable(CAP_SYS_RESOURCE)) { + errcode = -EPERM; + goto error_return; + } + if (!create_args) { + errcode = -EINVAL; + goto error_return; + } + + if (copy_from_user(&create, create_args, sizeof(create))) { + errcode = -EFAULT; + goto error_return; + } + + /* + * Allocate some of the memory we might need, before we start + * locking + */ + + attached = (struct job_attach *)kmalloc(sizeof(struct job_attach), GFP_KERNEL); + if (!attached) { + /* error */ + errcode = -ENOMEM; + goto error_return; + } + + job = (struct job_entry *)kmalloc(sizeof(struct job_entry), GFP_KERNEL); + if (!job) { + /* error */ + errcode = -ENOMEM; + goto error_return; + } + + /* We keep the old pnotify subscriber reference around in case we need it + * in an error condition. If, for example, a job_getjob call fails because + * the requested JID is already in use, we don't want to detach that job. + * Having this ability is complicated by the locking. + */ + rcu_read_lock(); + down_write(¤t->pnotify_subscriber_list_sem); + old_subscriber = pnotify_get_subscriber(current, events.name); + + /* + * Lock the job_table and add the pointers for the new job. + * Since the job is new, we won't need to lock the job. + */ + down_write(&job_table_sem); + + /* + * Determine if create should use specified JID or one that is + * generated. + */ + if (create.jid != 0) { + /* We use the specified JID value */ + job->jid = create.jid; + /* Does the supplied JID conflict with an existing one? */ + if (job_getjob(job->jid)) { + /* JID already in use, bail. error_return tosses/frees job */ + + /* error_return doesn't do up_write() */ + up_write(&job_table_sem); + /* we haven't allocated a new pnotify subscriber refernce yet so + * error_return won't unlock this. We'll unlock here */ + up_write(¤t->pnotify_subscriber_list_sem); + rcu_read_unlock(); + errcode = -EBUSY; + /* error_return doesn't touch old_subscriber so we don't detach */ + goto error_return; + } + } else { + /* We generate a new JID value using a new JID */ + *(iptr_hid(job->jid)) = jid_hid; + *(iptr_sid(job->jid)) = jid_count; + initial_jid_count = jid_count++; + while (((job->jid == 0) || (job_getjob(job->jid))) && + jid_count != initial_jid_count) { + + /* JID was in use or was zero, try a new one */ + *(iptr_sid(job->jid)) = jid_count++; + } + /* If all the JIDs are in use, fail */ + if (jid_count == initial_jid_count) { + /* error_return tosses/frees job */ + /* error_return doesn't do up_write() */ + up_write(&job_table_sem); + /* we haven't allocated a new pagg yet so error_return won't unlock + * this. We'll unlock here */ + up_write(¤t->pnotify_subscriber_list_sem); + rcu_read_unlock(); + errcode = -EBUSY; + /* error_return doesn't touch old_pagg so we don't detach */ + goto error_return; + } + + } + + subscriber = pnotify_subscribe(current, &events); + if (!subscriber) { + /* error */ + up_write(&job_table_sem); /* unlock since error_return doesn't */ + up_write(¤t->pnotify_subscriber_list_sem); + rcu_read_unlock(); + errcode = -ENOMEM; + goto error_return; + } + + /* Initialize job entry values & lists */ + job->refcnt = 1; + job->user = create.user; + job->start = jiffies; + job->csa.corehimem = 0; + job->csa.virthimem = 0; + job->csa.acctfile = NULL; + job->state = RUNNING; + init_rwsem(&job->sem); + INIT_LIST_HEAD(&job->attached); + list_add_tail(&attached->entry, &job->attached); + init_waitqueue_head(&job->wait); + init_waitqueue_head(&job->zombie); + job->waitcnt = 0; + job->waitinfo.status = 0; + + /* set link from entry in attached list to task and job entry */ + attached->task = current; + attached->job = job; + attached->subscriber = subscriber; + subscriber->data = (void *)attached; + + /* Insert new job into front of chain list */ + list_add_tail(&job->entry, &job_table[ jid_hash(job->jid) ]);; + ++job_table_refcnt; + + up_write(&job_table_sem); + /* At this point, the possible error conditions where we would need the + * old pnotify subscriber are gone. So we can remove it. We remove after + * we unlock because the detach function does job table lock of its own. + */ + if (old_subscriber) { + /* + * Detaching subscribers for jobs never has a failure case, + * so we don't need to worry about error codes. + */ + old_subscriber->events->exit(current, old_subscriber); + pnotify_unsubscribe(old_subscriber); + } + up_write(¤t->pnotify_subscriber_list_sem); + rcu_read_unlock(); + + /* Issue callbacks into accounting subscribers */ + + /* - CSA subscriber */ + if (acct_list[JOB_ACCT_CSA]) { + acct = acct_list[JOB_ACCT_CSA]; + if (acct->module) { + if (try_module_get(acct->module) == 0) { + printk(KERN_WARNING + "job_dispatch_create: Tried to get non-living acct module\n"); + } + } + if (acct->jobstart) { + int res; + struct job_csa csa; + + csa.job_id = job->jid; + csa.job_uid = job->user; + csa.job_start = job->start; + csa.job_corehimem = job->csa.corehimem; + csa.job_virthimem = job->csa.virthimem; + csa.job_acctfile = job->csa.acctfile; + + res = acct->jobstart(JOB_EVENT_START, &csa); + if (res < 0) { + printk(KERN_WARNING "job_dispatch_create: CSA -" + " jobstart failed.\n"); + } + } + if (acct->module) + module_put(acct->module); + } + + + create.r_jid = job->jid; + if (copy_to_user(create_args, &create, sizeof(create))) { + return -EFAULT; + } + + return 0; + +error_return: + kfree(attached); + kfree(job); + create.r_jid = 0; + if (copy_to_user(create_args, &create, sizeof(create))) + return -EFAULT; + + return errcode; +} + + +/* + * job_dispatch_getjid - retrieves the job ID (jid) for the specified process (pid) + * @getjid_args: Pointer of job_getjid struct which stores the get request + * + * returns 0 on success, negative errno value on exit. + */ +static int +job_dispatch_getjid(struct job_getjid *getjid_args) +{ + struct job_getjid getjid; + int errcode = 0; + struct task_struct *task; + + if (copy_from_user(&getjid, getjid_args, sizeof(getjid))) + return -EFAULT; + + /* lock the tasklist until we grab the specific task */ + read_lock(&tasklist_lock); + + if (getjid.pid == current->pid) { + task = current; + } else { + task = find_task_by_pid(getjid.pid); + } + if (task) { + get_task_struct(task); /* Ensure the task doesn't vanish on us */ + read_unlock(&tasklist_lock); /* unlock the task list */ + getjid.r_jid = job_getjid(task); + put_task_struct(task); /* We're done accessing the task */ + if (getjid.r_jid == 0) { + errcode = -ENODATA; + } + } else { + read_unlock(&tasklist_lock); + getjid.r_jid = 0; + errcode = -ESRCH; + } + + + if (copy_to_user(getjid_args, &getjid, sizeof(getjid))) + return -EFAULT; + return errcode; +} + + +/* + * job_dispatch_waitjid - allows a process to wait until a job exits + * @waitjid_args: Pointer of job_waitjid struct which stores the wait request + * + * On success returns 0, failure it returns the negative errno value. + */ +static int +job_dispatch_waitjid(struct job_waitjid *waitjid_args) +{ + struct job_waitjid waitjid; + struct job_entry *job; + int retcode = 0; + + if (copy_from_user(&waitjid, waitjid_args, sizeof(waitjid))) + return -EFAULT; + + waitjid.r_jid = waitjid.stat = 0; + + if (waitjid.options != 0) { + retcode = -EINVAL; + goto general_return; + } + + /* Lock the job table so that the current jobs don't change */ + down_read(&job_table_sem); + + if ((job = job_getjob(waitjid.jid)) == NULL ) { + up_read(&job_table_sem); + retcode = -ENODATA; + goto general_return; + } + + /* + * We got the job we need, we can release the job_table_sem + */ + down_write(&job->sem); + up_read(&job_table_sem); + + ++job->waitcnt; + + up_write(&job->sem); + + /* We shouldn't hold any locks at this point! The increment of the + * jobs waitcnt will ensure that the job is not removed without + * first notifying this current task */ + retcode = wait_event_interruptible(job->wait, + job->refcnt == 0); + + if (!retcode) { + /* + * This data is static at this point, we will + * not need a lock to read it. + */ + waitjid.stat = job->waitinfo.status; + waitjid.r_jid = job->jid; + } + + down_write(&job->sem); + --job->waitcnt; + + if (job->waitcnt == 0) { + up_write(&job->sem); + + /* + * We shouldn't hold any locks at this point! Else, the + * last process in the job will not be able to remove the + * job entry. + * + * That process is stuck waiting for this wake_up, so the + * job shouldn't disappear until after this function call. + * The job entry is not longer in the job table, so no + * other process can get to the entry to foul things up. + */ + wake_up(&job->zombie); + } else { + up_write(&job->sem); + } + +general_return: + if (copy_to_user(waitjid_args, &waitjid, sizeof(waitjid))) + return -EFAULT; + return retcode; +} + + +/* + * job_dispatch_killjid - send a signal to all processes in a job + * @killjid_args: Pointer of job_killjid struct which stores the kill request + * + * returns 0 on success, negative of errno on failure. + */ +static int +job_dispatch_killjid(struct job_killjid *killjid_args) +{ + struct job_killjid killjid; + struct job_entry *job; + struct list_head *attached_entry; + struct siginfo info; + int retcode = 0; + + if (copy_from_user(&killjid, killjid_args, sizeof(killjid))) { + retcode = -EFAULT; + goto cleanup_0locks_return; + } + + killjid.r_val = -1; + + /* A signal of zero is really a status check and is handled as such + * by send_sig_info. So we have < 0 instead of <= 0 here. + */ + if (killjid.sig < 0) { + retcode = -EINVAL; + goto cleanup_0locks_return; + } + + down_read(&job_table_sem); + job = job_getjob(killjid.jid); + if (!job) { + /* Job not found, copy back data & bail with error */ + retcode = -ENODATA; + goto cleanup_1locks_return; + } + + down_read(&job->sem); + + /* + * Check capability to signal job. The signaling user must be + * the owner of the job or have CAP_SYS_RESOURCE capability. + */ + if (!capable(CAP_SYS_RESOURCE)) { + if (current->uid != job->user) { + retcode = -EPERM; + goto cleanup_2locks_return; + } + } + + info.si_signo = killjid.sig; + info.si_errno = 0; + info.si_code = SI_USER; + info.si_pid = current->pid; + info.si_uid = current->uid; + + /* send_group_sig_info needs the tasklist lock locked */ + read_lock(&tasklist_lock); + list_for_each(attached_entry, &job->attached) { + int err; + struct job_attach *attached; + + attached = list_entry(attached_entry, struct job_attach, entry); + err = send_group_sig_info(killjid.sig, &info, + attached->task); + if (err != 0) { + /* + * XXX - the "prime" process, or initiating process + * for the job may not be owned by the user. So, + * we would get an error in this case. However, we + * ignore the error for that specific process - it + * should exit when all the child processes exit. It + * should ignore all signals from the user. + * + */ + if (attached->entry.prev != &job->attached) { + retcode = err; + } + } + + } + read_unlock(&tasklist_lock); + +cleanup_2locks_return: + up_read(&job->sem); +cleanup_1locks_return: + up_read(&job_table_sem); +cleanup_0locks_return: + killjid.r_val = retcode; + + if (copy_to_user(killjid_args, &killjid, sizeof(killjid))) + return -EFAULT; + return retcode; +} + + +/* + * job_dispatch_getjidcnt - return the number of jobs currently on the system + * @jidcnt_args: Pointer of job_jidcnt struct which stores the get request + * + * returns 0 on success & it always succeeds. + */ +static int +job_dispatch_getjidcnt(struct job_jidcnt *jidcnt_args) +{ + struct job_jidcnt jidcnt; + + /* read lock might be overdoing it in this case */ + down_read(&job_table_sem); + jidcnt.r_val = job_table_refcnt; + up_read(&job_table_sem); + + if (copy_to_user(jidcnt_args, &jidcnt, sizeof(jidcnt))) + return -EFAULT; + + return 0; +} + + +/* + * job_dispatch_getjidlst - get the list of all jids currently on the system + * @jidlist_args: Pointer of job_jidlst struct which stores the get request + */ +static int +job_dispatch_getjidlst(struct job_jidlst *jidlst_args) +{ + struct job_jidlst jidlst; + u64 *jid; + struct job_entry *job; + struct list_head *job_entry; + int i; + int count; + + if (copy_from_user(&jidlst, jidlst_args, sizeof(jidlst))) + return -EFAULT; + + if (jidlst.r_val == 0) + return 0; + + jid = (u64 *)kmalloc(sizeof(u64)*jidlst.r_val, GFP_KERNEL); + if (!jid) { + jidlst.r_val = 0; + if (copy_to_user(jidlst_args, &jidlst, sizeof(jidlst))) + return -EFAULT; + return -ENOMEM; + } + + count = 0; + down_read(&job_table_sem); + for (i = 0; i < HASH_SIZE && count < jidlst.r_val; i++) { + list_for_each(job_entry, &job_table[i]) { + job = list_entry(job_entry, struct job_entry, entry); + jid[count++] = job->jid; + if (count == jidlst.r_val) { + break; + } + } + } + up_read(&job_table_sem); + + jidlst.r_val = count; + + for (i = 0; i < count; i++) + if (copy_to_user(jidlst.jid+i, &jid[i], sizeof(u64))) + return -EFAULT; + + kfree(jid); + + if (copy_to_user(jidlst_args, &jidlst, sizeof(jidlst))) + return -EFAULT; + return 0; +} + + +/* + * job_dispatch_getpidcnt - get the processe count in the specified job + * @pidcnt_args: Pointer of job_pidcnt struct which stores the get request + * + * returns 0 on success, or negative errno value on failure. + */ +static int +job_dispatch_getpidcnt(struct job_pidcnt *pidcnt_args) +{ + struct job_pidcnt pidcnt; + struct job_entry *job; + int retcode = 0; + + if (copy_from_user(&pidcnt, pidcnt_args, sizeof(pidcnt))) + return -EFAULT; + + pidcnt.r_val = 0; + + down_read(&job_table_sem); + job = job_getjob(pidcnt.jid); + if (!job) { + retcode = -ENODATA; + } else { + /* Read lock might be overdoing it for this case */ + down_read(&job->sem); + pidcnt.r_val = job->refcnt; + up_read(&job->sem); + } + up_read(&job_table_sem); + + if (copy_to_user(pidcnt_args, &pidcnt, sizeof(pidcnt))) + return -EFAULT; + return retcode; +} + +/* + * job_getpidlst - get the process list in the specified job + * @pidlst_args: Pointer of job_pidlst struct which stores the get request + * + * This function returns the the list of processes that are part of the job. + * The number of processes provided by this function could be trimmed if + * max size specified in r_val is not large enough to hold the entire list. + * + * returns 0 on success, negative errno value on failure. + */ +static int +job_dispatch_getpidlst(struct job_pidlst *pidlst_args) +{ + struct job_pidlst pidlst; + struct job_entry *job; + struct job_attach *attached; + struct list_head *attached_entry; + pid_t *pid; + int max; + int i; + + if (copy_from_user(&pidlst, pidlst_args, sizeof(pidlst))) + return -EFAULT; + + if (pidlst.r_val == 0) + return 0; + + max = pidlst.r_val; + pidlst.r_val = 0; + pid = (pid_t *)kmalloc(sizeof(pid_t)*max, GFP_KERNEL); + if (!pid) { + if (copy_to_user(pidlst_args, &pidlst, sizeof(pidlst))) + return -EFAULT; + return -ENOMEM; + } + + down_read(&job_table_sem); + + job = job_getjob(pidlst.jid); + if (!job) { + up_read(&job_table_sem); + if (copy_to_user(pidlst_args, &pidlst, sizeof(pidlst))) + return -EFAULT; + return -ENODATA; + } else { + + down_read(&job->sem); + up_read(&job_table_sem); + + i = 0; + list_for_each(attached_entry, &job->attached) { + if (i == max) { + break; + } + attached = list_entry(attached_entry, struct job_attach, + entry); + pid[i++] = attached->task->pid; + } + pidlst.r_val = i; + + up_read(&job->sem); + } + + for (i = 0; i < pidlst.r_val; i++) + if (copy_to_user(pidlst.pid+i, &pid[i], sizeof(pid_t))) + return -EFAULT; + kfree(pid); + + copy_to_user(pidlst_args, &pidlst, sizeof(pidlst)); + return 0; +} + + +/* + * job_dispatch_getuser - get the uid of the user that owns the job + * @user_args: Pointer of job_user struct which stores the get request + * + * returns 0 on success, returns negative errno on failure. + */ +static int +job_dispatch_getuser(struct job_user *user_args) +{ + struct job_entry *job; + struct job_user user; + int retcode = 0; + + if (copy_from_user(&user, user_args, sizeof(user))) + return(-EFAULT); + user.r_user = 0; + + down_read(&job_table_sem); + + job = job_getjob(user.jid); + if (!job) { + retcode = -ENODATA; + } else { + down_read(&job->sem); + user.r_user = job->user; + up_read(&job->sem); + } + + up_read(&job_table_sem); + + if (copy_to_user(user_args, &user, sizeof(user))) + return -EFAULT; + return retcode; +} + + +/* + * job_dispatch_getprimepid - get the oldest process (primepid) in the job + * @primepid_args: Pointer of job_primepid struct which stores the get request + * + * returns 0 on success, negative errno on failure. + */ +static int +job_dispatch_getprimepid(struct job_primepid *primepid_args) +{ + struct job_primepid primepid; + struct job_entry *job = NULL; + struct job_attach *attached = NULL; + int retcode = 0; + + if (copy_from_user(&primepid, primepid_args, sizeof(primepid))) + return -EFAULT; + + primepid.r_pid = 0; + + down_read(&job_table_sem); + + job = job_getjob(primepid.jid); + if (!job) { + up_read(&job_table_sem); + /* Job not found, return INVALID VALUE */ + return -ENODATA; + } + + /* + * Job found, now look at first pid entry in the + * attached list. + */ + down_read(&job->sem); + up_read(&job_table_sem); + if (list_empty(&job->attached)) { + retcode = -ESRCH; + primepid.r_pid = 0; + } else { + attached = list_entry(job->attached.next, struct job_attach, entry); + if (!attached->task) { + retcode = -ESRCH; + } else { + primepid.r_pid = attached->task->pid; + } + } + up_read(&job->sem); + + if (copy_to_user(primepid_args, &primepid, sizeof(primepid))) + return -EFAULT; + return retcode; +} + + +/* + * job_dispatch_sethid - set the host ID segment for the job IDs (jid) + * @sethid_args: Pointer of job_sethid struct which stores the set request + * + * If hid does not get set, then the jids upper 32 bits will be set to + * 0 and the jid cannot be used reliably in a cluster environment. + * + * returns -errno value on fail, 0 on success + */ +static int +job_dispatch_sethid(struct job_sethid *sethid_args) +{ + struct job_sethid sethid; + int errcode = 0; + + if (copy_from_user(&sethid, sethid_args, sizeof(sethid))) + return -EFAULT; + + if (!capable(CAP_SYS_RESOURCE)) { + errcode = -EPERM; + sethid.r_hid = 0; + goto cleanup_return; + } + + /* + * Set job_table_sem, so no jobs can be deleted while doing + * this operation. + */ + down_write(&job_table_sem); + + sethid.r_hid = jid_hid = sethid.hid; + + up_write(&job_table_sem); + +cleanup_return: + if (copy_to_user(sethid_args, &sethid, sizeof(sethid))) + return -EFAULT; + return errcode; +} + + +/* + * job_dispatch_detachjid - detach all processes attached to the specified job + * @detachjid_args: Pointer of job_detachjid struct + * + * The job will exit after the detach. The processes are allowed to + * continue running. You need CAP_SYS_RESOURCE capability for this to succeed. + * + * returns -errno value on fail, 0 on success + */ +static int +job_dispatch_detachjid(struct job_detachjid *detachjid_args) +{ + struct job_detachjid detachjid; + struct job_entry *job; + struct list_head *entry; + int count; + int errcode = 0; + struct task_struct *task; + struct pnotify_subscriber *subscriber; + + if (copy_from_user(&detachjid, detachjid_args, sizeof(detachjid))) + return -EFAULT; + + detachjid.r_val = 0; + + if (!capable(CAP_SYS_RESOURCE)) { + errcode = -EPERM; + goto cleanup_return; + } + + /* + * Set job_table_sem, so no jobs can be deleted while doing + * this operation. + */ + down_write(&job_table_sem); + + job = job_getjob(detachjid.jid); + + if (job) { + + down_write(&job->sem); + + /* Mark job as ZOMBIE so no new processes can attach to it */ + job->state = ZOMBIE; + + count = job->refcnt; + + /* Okay, no new processes can attach to the job. We can + * release the locks on the job_table and job since the only + * way for the job to change now is for tasks to detach and + * the job to be removed. And this is what we want to happen + */ + up_write(&job_table_sem); + up_write(&job->sem); + + + /* Walk through list of attached tasks and unset the + * pnotify subscriber entries. + * + * We don't test with list_empty because that actually means NO tasks + * left rather than one task. If we used !list_empty or list_for_each, + * we could reference memory freed by the pnotify hook detach function + * (job_detach). + * + * We know there is only one task left when job->attached.next and + * job->attached.prev both point to the same place. + */ + while (job->attached.next != job->attached.prev) { + entry = job->attached.next; + + task = (list_entry(entry, struct job_attach, entry))->task; + subscriber = (list_entry(entry, struct job_attach, entry))->subscriber; + + rcu_read_lock(); + down_write(&task->pnotify_subscriber_list_sem); + subscriber->events->exit(task, subscriber); + pnotify_unsubscribe(subscriber); + up_write(&task->pnotify_subscriber_list_sem); + rcu_read_unlock(); + + } + /* At this point, there is only one task left */ + + entry = job->attached.next; + + task = (list_entry(entry, struct job_attach, entry))->task; + subscriber = (list_entry(entry, struct job_attach, entry))->subscriber; + + rcu_read_lock(); + down_write(&task->pnotify_subscriber_list_sem); + subscriber->events->exit(task, subscriber); + pnotify_unsubscribe(subscriber); + up_write(&task->pnotify_subscriber_list_sem); + rcu_read_unlock(); + + detachjid.r_val = count; + + } else { + errcode = -ENODATA; + up_write(&job_table_sem); + } + +cleanup_return: + if (copy_to_user(detachjid_args, &detachjid, sizeof(detachjid))) + return -EFAULT; + return errcode; +} + + +/* + * job_dispatch_detachpid - detach a process from the job it is attached to + * @detachpid_args: Pointer of job_detachpid struct. + * + * That process is allowed to continue running. You need + * CAP_SYS_RESOURCE capability for this to succeed. + * + * returns -errno value on fail, 0 on success + */ +static int +job_dispatch_detachpid(struct job_detachpid *detachpid_args) +{ + struct job_detachpid detachpid; + struct task_struct *task; + struct pnotify_subscriber *subscriber; + int errcode = 0; + + if (copy_from_user(&detachpid, detachpid_args, sizeof(detachpid))) + return -EFAULT; + + detachpid.r_jid = 0; + + if (!capable(CAP_SYS_RESOURCE)) { + errcode = -EPERM; + goto cleanup_return; + } + + /* Lock the task list while we find a specific task */ + read_lock(&tasklist_lock); + task = find_task_by_pid(detachpid.pid); + if (!task) { + errcode = -ESRCH; + /* We need to unlock the tasklist here too or the lock is held forever */ + read_unlock(&tasklist_lock); + goto cleanup_return; + } + + /* We have a valid task now */ + get_task_struct(task); /* Ensure the task doesn't vanish on us */ + read_unlock(&tasklist_lock); /* Unlock the tasklist */ + rcu_read_lock(); + down_write(&task->pnotify_subscriber_list_sem); + + subscriber = pnotify_get_subscriber(task, events.name); + if (subscriber) { + detachpid.r_jid = ((struct job_attach *)subscriber->data)->job->jid; + subscriber->events->exit(task, subscriber); + pnotify_unsubscribe(subscriber); + } else { + errcode = -ENODATA; + } + put_task_struct(task); /* Done accessing the task */ + up_write(&task->pnotify_subscriber_list_sem); + rcu_read_unlock(); + +cleanup_return: + if (copy_to_user(detachpid_args, &detachpid, sizeof(detachpid))) + return -EFAULT; + return errcode; +} + +/* + * job_dispatch_attachpid - attach a process to the specified job + * @attachpid_args: Pointer of job_attachpid struct. + * + * The attaching process must not belong to any job and the specified job + * must exist. You need CAP_SYS_RESOURCE capability for this to succeed. + * + * returns -errno value on fail, 0 on success + */ +static int +job_dispatch_attachpid(struct job_attachpid *attachpid_args) +{ + struct job_attachpid attachpid; + struct task_struct *task; + struct pnotify_subscriber *subscriber; + struct job_entry *job = NULL; + struct job_attach *attached = NULL; + int errcode = 0; + + if (copy_from_user(&attachpid, attachpid_args, sizeof(attachpid))) + return -EFAULT; + + if (!capable(CAP_SYS_RESOURCE)) { + errcode = -EPERM; + goto cleanup_return; + } + + /* lock the tasklist until we grab the specific task */ + read_lock(&tasklist_lock); + task = find_task_by_pid(attachpid.pid); + if (!task) { + errcode = -ESRCH; + /* We need to unlock the tasklist here too or the lock is held f +orever */ + read_unlock(&tasklist_lock); + goto cleanup_return; + } + + /* We have a valid task now */ + get_task_struct(task); /* Ensure the task doesn't vanish on us */ + read_unlock(&tasklist_lock); /* Unlock the tasklist */ + + rcu_read_lock(); + down_write(&task->pnotify_subscriber_list_sem); + /* check if it belongs to a job*/ + subscriber = pnotify_get_subscriber(task, events.name); + if (subscriber) { + put_task_struct(task); + up_write(&task->pnotify_subscriber_list_sem); + rcu_read_unlock(); + errcode = -EINVAL; + goto cleanup_return; + } + + /* Alloc subscriber list entry for it */ + subscriber = pnotify_subscribe(task, &events); + if (subscriber) { + down_read(&job_table_sem); + /* Check on the requested job */ + job = job_getjob(attachpid.r_jid); + if (!job) { + pnotify_unsubscribe(subscriber); + errcode = -ENODATA; + } + else { + attached = list_entry(job->attached.next, struct job_attach, entry); + if(attached) { + if (subscriber->events->fork(task, subscriber, attached) != 0) { + pnotify_unsubscribe(subscriber); + errcode = -EFAULT; + } + } + } + up_read(&job_table_sem); + } else + errcode = -ENOMEM; + put_task_struct(task); /* Done accessing the task */ + up_write(&task->pnotify_subscriber_list_sem); + rcu_read_unlock(); + +cleanup_return: + if (copy_to_user(attachpid_args, &attachpid, sizeof(attachpid))) + return -EFAULT; + return errcode; +} + + +/* + * job_register_acct - accounting modules register to job module + * @am: The registering accounting module's job_acctmod pointer + * + * returns -errno value on fail, 0 on success. + */ +int +job_register_acct(struct job_acctmod *am) +{ + if (!am) + return -EINVAL; /* error, invalid value */ + if (am->type < 0 || am->type > (JOB_ACCT_COUNT-1)) + return -EINVAL; /* error, invalid value */ + + down_write(&acct_list_sem); + if (acct_list[am->type] != NULL) { + up_write(&acct_list_sem); + return -EBUSY; /* error, duplicate entry */ + } + + acct_list[am->type] = am; + up_write(&acct_list_sem); + return 0; +} + + +/* + * job_unregister_acct - accounting modules to unregister with the job module + * @am: The unregistering accounting module's job_acctmod pointer + * + * Returns -errno on failure and 0 on success. + */ +int +job_unregister_acct(struct job_acctmod *am) +{ + if (!am) + return -EINVAL; /* error, invalid value */ + if (am->type < 0 || am->type > (JOB_ACCT_COUNT-1)) + return -EINVAL; /* error, invalid value */ + + down_write(&acct_list_sem); + + if (acct_list[am->type] != am) { + up_write(&acct_list_sem); + return -EFAULT; /* error, not matching entry */ + } + + acct_list[am->type] = NULL; + up_write(&acct_list_sem); + return 0; +} + +/* + * job_getjid - return the Job ID for the given task. + * @task: The given task + * + * If the task is not attached to a job, then 0 is returned. + * + */ +u64 job_getjid(struct task_struct *task) +{ + struct pnotify_subscriber *subscriber = NULL; + struct job_entry *job = NULL; + u64 jid = 0; + + rcu_read_lock(); + subscriber = pnotify_get_subscriber(task, events.name); + if (subscriber) { + job = ((struct job_attach *)subscriber->data)->job; + down_read(&job->sem); + jid = job->jid; + up_read(&job->sem); + } + rcu_read_unlock(); + + return jid; +} + + +/* + * job_getacct - accounting subscribers get accounting information about a job. + * @jid: the job id + * @type: the accounting subscriber type + * @data: the accounting data that subscriber wants. + * + * The caller must supply the Job ID (jid) that specifies the job. The + * "type" argument indicates the type of accounting data to be returned. + * The data will be returned in the memory accessed via the data pointer + * argument. The data pointer is void so that this function interface + * can handle different types of accounting data. + * + */ +int job_getacct(u64 jid, int type, void *data) +{ + struct job_entry *job; + + if (!data) + return -EINVAL; + + if (!jid) + return -EINVAL; + + down_read(&job_table_sem); + job = job_getjob(jid); + if (!job) { + up_read(&job_table_sem); + return -ENODATA; + } + + down_read(&job->sem); + up_read(&job_table_sem); + + switch (type) { + case JOB_ACCT_CSA: + { + struct job_csa *csa = (struct job_csa *)data; + + csa->job_id = job->jid; + csa->job_uid = job->user; + csa->job_start = job->start; + csa->job_corehimem = job->csa.corehimem; + csa->job_virthimem = job->csa.virthimem; + csa->job_acctfile = job->csa.acctfile; + break; + } + default: + up_read(&job->sem); + return -EINVAL; + break; + } + up_read(&job->sem); + return 0; +} + +/* + * job_setacct - accounting subscribers set accounting info in the job + * @jid: the job id + * @type: the accounting subscriber type. + * @subfield: the accounting information subfield for this set call + * @data: the accounting information to be set + * + * The job is identified by the jid argument. The type indicates the + * type of accounting the information is associated with. The subfield + * is a bitmask that indicates exactly what subfields are to be changed. + * The data that is used to set the values is supplied by the data pointer. + * The data pointer is a void type so that the interface can be used for + * different types of accounting information. + */ +int job_setacct(u64 jid, int type, int subfield, void *data) +{ + struct job_entry *job; + + if (!data) + return -EINVAL; + + if (!jid) + return -EINVAL; + + down_read(&job_table_sem); + job = job_getjob(jid); + if (!job) { + up_read(&job_table_sem); + return -ENODATA; + } + + down_read(&job->sem); + up_read(&job_table_sem); + + switch (type) { + case JOB_ACCT_CSA: + { + struct job_csa *csa = (struct job_csa *)data; + + if (subfield & JOB_CSA_ACCTFILE) { + job->csa.acctfile = csa->job_acctfile; + } + if (subfield & JOB_CSA_COREHIMEM) { + job->csa.corehimem = csa->job_corehimem; + } + if (subfield & JOB_CSA_VIRTHIMEM) { + job->csa.virthimem = csa->job_virthimem; + } + break; + } + default: + up_read(&job->sem); + return -EINVAL; + break; + } + up_read(&job->sem); + return 0; +} + + + +/* + * job_dispatcher - handles job ioctl requests + * @request: The syscall request type + * @data: The syscall request data + * + * Returns 0 on success and -(ERRNO VALUE) upon failure. + */ +int +job_dispatcher(unsigned int request, unsigned long data) +{ + int rc=0; + + switch (request) { + case JOB_CREATE: + rc = job_dispatch_create((struct job_create *)data); + break; + case JOB_ATTACH: + case JOB_DETACH: + /* RESERVED */ + rc = -EBADRQC; + break; + case JOB_GETJID: + rc = job_dispatch_getjid((struct job_getjid *)data); + break; + case JOB_WAITJID: + rc = job_dispatch_waitjid((struct job_waitjid *)data); + break; + case JOB_KILLJID: + rc = job_dispatch_killjid((struct job_killjid *)data); + break; + case JOB_GETJIDCNT: + rc = job_dispatch_getjidcnt((struct job_jidcnt *)data); + break; + case JOB_GETJIDLST: + rc = job_dispatch_getjidlst((struct job_jidlst *)data); + break; + case JOB_GETPIDCNT: + rc = job_dispatch_getpidcnt((struct job_pidcnt *)data); + break; + case JOB_GETPIDLST: + rc = job_dispatch_getpidlst((struct job_pidlst *)data); + break; + case JOB_GETUSER: + rc = job_dispatch_getuser((struct job_user *)data); + break; + case JOB_GETPRIMEPID: + rc = job_dispatch_getprimepid((struct job_primepid *)data); + break; + case JOB_SETHID: + rc = job_dispatch_sethid((struct job_sethid *)data); + break; + case JOB_DETACHJID: + rc = job_dispatch_detachjid((struct job_detachjid *)data); + break; + case JOB_DETACHPID: + rc = job_dispatch_detachpid((struct job_detachpid *)data); + break; + case JOB_ATTACHPID: + rc = job_dispatch_attachpid((struct job_attachpid *)data); + break; + case JOB_SETJLIMIT: + case JOB_GETJLIMIT: + case JOB_GETJUSAGE: + case JOB_FREE: + default: + rc = -EBADRQC; + break; + } + + return rc; +} + + +/* + * job_ioctl - handles job ioctl call requests + * + * + * Returns 0 on success and -(ERRNO VALUE) upon failure. + */ +int +job_ioctl(struct inode *inode, struct file *file, unsigned int request, + unsigned long data) +{ + return job_dispatcher(request, data); +} + + +/* + * init_module + * + * This function is called when a module is inserted into a kernel. This + * function allocates any necessary structures and sets initial values for + * module data. + * + * If the function succeeds, then 0 is returned. On failure, -1 is returned. + */ +static int __init +init_job(void) +{ + int i,rc; + + + /* Initialize the job table chains */ + for (i = 0; i < HASH_SIZE; i++) { + INIT_LIST_HEAD(&job_table[i]); + } + + /* Get hostID string and fill in jid_template hostID segment */ + if (hid) { + jid_hid = (int)simple_strtoul(hid, &hid, 16); + } else { + jid_hid = 0; + } + + rc = pnotify_register(&events); + if (rc < 0) { + return -1; + } + + /* Setup our /proc entry file */ + job_proc_entry = create_proc_entry(JOB_PROC_ENTRY, + S_IFREG | S_IRUGO, &proc_root); + + if (!job_proc_entry) { + pnotify_unregister(&events); + return -1; + } + + job_proc_entry->proc_fops = &job_file_ops; + job_proc_entry->proc_iops = NULL; + + + return 0; +} +module_init(init_job); + +/* + * cleanup_module + * + * This function is called to cleanup after a module when it is removed. + * All memory allocated for this module will be freed. + * + * This function does not take any inputs or produce and output. + */ +static void __exit +cleanup_job(void) +{ + remove_proc_entry(JOB_PROC_ENTRY, &proc_root); + pnotify_unregister(&events); + return; +} +module_exit(cleanup_job); + +EXPORT_SYMBOL(job_register_acct); +EXPORT_SYMBOL(job_unregister_acct); +EXPORT_SYMBOL(job_getjid); +EXPORT_SYMBOL(job_getacct); +EXPORT_SYMBOL(job_setacct); Index: linux/kernel/Makefile =================================================================== --- linux.orig/kernel/Makefile 2005-09-27 17:46:52.056156447 -0500 +++ linux/kernel/Makefile 2005-09-27 17:53:47.970289106 -0500 @@ -21,6 +21,7 @@ obj-$(CONFIG_KEXEC) += kexec.o obj-$(CONFIG_COMPAT) += compat.o obj-$(CONFIG_PNOTIFY) += pnotify.o +obj-$(CONFIG_JOB) += job.o obj-$(CONFIG_CPUSETS) += cpuset.o obj-$(CONFIG_IKCONFIG) += configs.o obj-$(CONFIG_IKCONFIG_PROC) += configs.o Index: linux/include/linux/jobctl.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux/include/linux/jobctl.h 2005-09-27 17:53:47.977124355 -0500 @@ -0,0 +1,185 @@ +/* + * + * Copyright (c) 2000-2005 Silicon Graphics, Inc. All Rights Reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * + * Contact information: Silicon Graphics, Inc., 1500 Crittenden Lane, + * Mountain View, CA 94043, or: + * + * http://www.sgi.com + * + * For further information regarding this notice, see: + * + * http://oss.sgi.com/projects/GenInfo/NoticeExplan + * + * + * Description: This file, include/linux/jobctl.h, contains the data + * definitions used by job to communicate with pnotify via the /proc/job + * ioctl interface. + * + */ + +#ifndef _LINUX_JOBCTL_H +#define _LINUX_JOBCTL_H +#ifndef __KERNEL__ +#include <stdint.h> +#include <sys/types.h> +#include <asm/unistd.h> +#endif + +#define PNOTIFY_NAMELEN 32 /* Max chars in PNOTIFY module name */ +#de... [truncated message content] |
From: Erik J. <er...@sg...> - 2005-09-29 16:58:59
|
This patch implements some parts of the keyring support using the rcu version of pnotify. It may also be illegal because it could sleep with the rcu_read_lock open. See my performance post to follow shortly for a discussion on these things. include/linux/key.h | 21 ++++- include/linux/sched.h | 4 kernel/exit.c | 1 kernel/fork.c | 6 - security/keys/key.c | 24 +++++ security/keys/keyctl.c | 34 +++++++- security/keys/process_keys.c | 173 +++++++++++++++++++++++++++++++++++++------ security/keys/request_key.c | 37 ++++++++- 8 files changed, 257 insertions(+), 43 deletions(-) Index: linux/include/linux/key.h =================================================================== --- linux.orig/include/linux/key.h 2005-09-27 16:01:17.561027702 -0500 +++ linux/include/linux/key.h 2005-09-27 16:05:34.078204518 -0500 @@ -19,6 +19,7 @@ #include <linux/list.h> #include <linux/rbtree.h> #include <linux/rcupdate.h> +#include <linux/pnotify.h> #include <asm/atomic.h> #ifdef __KERNEL__ @@ -262,9 +263,9 @@ extern struct key root_user_keyring, root_session_keyring; extern int alloc_uid_keyring(struct user_struct *user); extern void switch_uid_keyring(struct user_struct *new_user); -extern int copy_keys(unsigned long clone_flags, struct task_struct *tsk); +extern int copy_keys(struct task_struct *tsk, struct pnotify_subscriber *sub, void *olddata); extern int copy_thread_group_keys(struct task_struct *tsk); -extern void exit_keys(struct task_struct *tsk); +extern void exit_keys(struct task_struct *task, struct pnotify_subscriber *sub); extern void exit_thread_group_keys(struct signal_struct *tg); extern int suid_keys(struct task_struct *tsk); extern int exec_keys(struct task_struct *tsk); @@ -279,6 +280,22 @@ old_session; \ }) +/* pnotify subscriber service request */ +static struct pnotify_events key_events = { + .module = NULL, + .name = "key", + .data = NULL, + .entry = LIST_HEAD_INIT(key_events.entry), + .fork = copy_keys, + .exit = exit_keys, +}; + +/* key info associated with the task struct and managed by pnotify */ +struct key_task { + struct key *thread_keyring; /* keyring private to this thread */ + unsigned char jit_keyring; /* default keyring to attach requested keys to */ +}; + #else /* CONFIG_KEYS */ #define key_validate(k) 0 Index: linux/kernel/exit.c =================================================================== --- linux.orig/kernel/exit.c 2005-09-27 16:01:17.562004166 -0500 +++ linux/kernel/exit.c 2005-09-27 16:05:34.082110375 -0500 @@ -857,7 +857,6 @@ exit_namespace(tsk); exit_thread(); cpuset_exit(tsk); - exit_keys(tsk); if (group_dead && tsk->signal->leader) disassociate_ctty(1); Index: linux/kernel/fork.c =================================================================== --- linux.orig/kernel/fork.c 2005-09-27 16:01:17.562004166 -0500 +++ linux/kernel/fork.c 2005-09-27 16:05:34.083086839 -0500 @@ -1009,10 +1009,8 @@ goto bad_fork_cleanup_sighand; if ((retval = copy_mm(clone_flags, p))) goto bad_fork_cleanup_signal; - if ((retval = copy_keys(clone_flags, p))) - goto bad_fork_cleanup_mm; if ((retval = copy_namespace(clone_flags, p))) - goto bad_fork_cleanup_keys; + goto bad_fork_cleanup_mm; retval = copy_thread(0, clone_flags, stack_start, stack_size, p, regs); if (retval) goto bad_fork_cleanup_namespace; @@ -1175,8 +1173,6 @@ bad_fork_cleanup_namespace: pnotify_exit(p); exit_namespace(p); -bad_fork_cleanup_keys: - exit_keys(p); bad_fork_cleanup_mm: if (p->mm) mmput(p->mm); Index: linux/security/keys/key.c =================================================================== --- linux.orig/security/keys/key.c 2005-09-27 16:01:17.562980631 -0500 +++ linux/security/keys/key.c 2005-09-27 16:05:34.088945625 -0500 @@ -15,6 +15,7 @@ #include <linux/slab.h> #include <linux/workqueue.h> #include <linux/err.h> +#include <linux/pnotify.h> #include "internal.h" static kmem_cache_t *key_jar; @@ -1009,6 +1010,9 @@ */ void __init key_init(void) { + struct key_task *kt; + struct pnotify_subscriber *sub; + /* allocate a slab in which we can store keys */ key_jar = kmem_cache_create("key_jar", sizeof(struct key), 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL); @@ -1039,4 +1043,24 @@ /* link the two root keyrings together */ key_link(&root_session_keyring, &root_user_keyring); + /* Allocate memory for task assocated key_task structure */ + kt = (struct key_task *)kmalloc(sizeof(struct key_task),GFP_KERNEL); + if (!kt) { + printk(KERN_ERR "Insufficient memory to allocate key_task structure" + " in key_init function.\n"); + return; + } + kt->thread_keyring = NULL; + + /* subscribe this kernel entity to the subscriber list for current task */ + /* Here, there is only one process in existence so we don't do any locking. + */ + sub = pnotify_subscribe(current, &key_events); + if (!sub) { + printk(KERN_ERR "Insufficient memory to add to subscriber list structure" + " in key_init function.\n"); + } + /* Associate the kt structure with this task via pnotify subscriber */ + sub->data = (void *)kt; + } /* end key_init() */ Index: linux/security/keys/process_keys.c =================================================================== --- linux.orig/security/keys/process_keys.c 2005-09-27 16:01:17.562980631 -0500 +++ linux/security/keys/process_keys.c 2005-09-27 16:17:39.829430380 -0500 @@ -16,6 +16,7 @@ #include <linux/keyctl.h> #include <linux/fs.h> #include <linux/err.h> +#include <linux/pnotify.h> #include <asm/uaccess.h> #include "internal.h" @@ -137,6 +138,8 @@ int install_thread_keyring(struct task_struct *tsk) { struct key *keyring, *old; + struct key_task *kt; + struct pnotify_subscriber *sub; char buf[20]; int ret; @@ -149,9 +152,24 @@ } task_lock(tsk); - old = tsk->thread_keyring; - tsk->thread_keyring = keyring; + rcu_read_lock(); + down_write(&tsk->pnotify_subscriber_list_sem); + sub = pnotify_get_subscriber(tsk, key_events.name); + if (sub == NULL || sub->data == NULL) { /* shouldn't happen */ + printk(KERN_ERR "install_thread_keyring pnotify subscriber or data ptr null, task: %d\n", tsk->pid); + task_unlock(tsk); + up_write(&tsk->pnotify_subscriber_list_sem); + rcu_read_unlock(); + ret = PTR_ERR(sub); + goto error; + } + kt = (struct key_task *)sub->data; + + old = kt->thread_keyring; + kt->thread_keyring = keyring; task_unlock(tsk); + up_write(&tsk->pnotify_subscriber_list_sem); + rcu_read_unlock(); ret = 0; @@ -267,13 +285,27 @@ /* * copy the keys for fork */ -int copy_keys(unsigned long clone_flags, struct task_struct *tsk) +int copy_keys(struct task_struct *tsk, struct pnotify_subscriber *sub, void *olddata) { - key_check(tsk->thread_keyring); + struct key_task *kt = ((struct key_task *)(sub->data)); + + /* Allocate memory for task-associated key_task structure */ + kt = (struct key_task *)kmalloc(sizeof(struct key_task),GFP_KERNEL); + if (!kt) { + printk(KERN_ERR "Insufficient memory to allocate key_task structure" + " in copy_keys function. Task was: %d", tsk->pid); + return PNOTIFY_ERROR; + } + /* Associate key_task structure with the new child via pnotify subscriber */ + /* At this moment, this is an in-construction-task so locking isn't an + * issue */ + sub->data = (void *)kt; + + key_check(kt->thread_keyring); /* no thread keyring yet */ - tsk->thread_keyring = NULL; - return 0; + kt->thread_keyring = NULL; + return PNOTIFY_OK; } /* end copy_keys() */ @@ -292,9 +324,16 @@ /* * dispose of keys upon thread exit */ -void exit_keys(struct task_struct *tsk) +void exit_keys(struct task_struct *task, struct pnotify_subscriber *sub) { - key_put(tsk->thread_keyring); + struct key_task *kt = ((struct key_task *)(sub->data)); + if (kt == NULL) { /* shouldn't ever happen */ + printk(KERN_ERR "exit_keys pnotify subscriber data ptr null, task: %d\n", task->pid); + return; + } + key_put(kt->thread_keyring); + kfree(kt); /* Free pnotify subscriber data for this task */ + sub->data = NULL; } /* end exit_keys() */ @@ -306,12 +345,31 @@ { unsigned long flags; struct key *old; + struct key_task *kt; + struct pnotify_subscriber *sub; - /* newly exec'd tasks don't get a thread keyring */ task_lock(tsk); - old = tsk->thread_keyring; - tsk->thread_keyring = NULL; + /* pnotify doesn't have a compute_creds event at this time, so we + * need to retrieve the data */ + + rcu_read_lock(); + down_write(&tsk->pnotify_subscriber_list_sem); + sub = pnotify_get_subscriber(tsk, key_events.name); + if (sub == NULL || sub->data == NULL) { /* shouldn't happen */ + printk(KERN_ERR "exec_keys pnotify subscriber or data ptr null, task: %d\n", tsk->pid); + task_unlock(tsk); + up_write(&tsk->pnotify_subscriber_list_sem); + rcu_read_unlock(); + return PNOTIFY_OK; /* key structures not populated yet */ + } + kt = (struct key_task *)sub->data; + + /* newly exec'd tasks don't get a thread keyring */ + old = kt->thread_keyring; + kt->thread_keyring = NULL; task_unlock(tsk); + up_write(&tsk->pnotify_subscriber_list_sem); + rcu_read_unlock(); key_put(old); @@ -344,12 +402,29 @@ */ void key_fsuid_changed(struct task_struct *tsk) { + struct key_task *kt; + struct pnotify_subscriber *sub; + + /* no pnotify event for this, so we need to grab the data */ + rcu_read_lock(); + down_write(&tsk->pnotify_subscriber_list_sem); + sub = pnotify_get_subscriber(tsk, key_events.name); + if (sub == NULL || sub->data == NULL) { /* shouldn't happen */ + printk(KERN_ERR "key_fsuid_changed pnotify subscriber or data ptr null, task: %d\n", tsk->pid); + up_write(&tsk->pnotify_subscriber_list_sem); + rcu_read_unlock(); + return; + } + kt = (struct key_task *)sub->data; + /* update the ownership of the thread keyring */ - if (tsk->thread_keyring) { - down_write(&tsk->thread_keyring->sem); - tsk->thread_keyring->uid = tsk->fsuid; - up_write(&tsk->thread_keyring->sem); + if (kt->thread_keyring) { + down_write(&kt->thread_keyring->sem); + kt->thread_keyring->uid = tsk->fsuid; + up_write(&kt->thread_keyring->sem); } + up_write(&tsk->pnotify_subscriber_list_sem); + rcu_read_unlock(); } /* end key_fsuid_changed() */ @@ -359,12 +434,29 @@ */ void key_fsgid_changed(struct task_struct *tsk) { + struct key_task *kt; + struct pnotify_subscriber *sub; + + /* pnotify doesn't have an event for this, so we need to grab the data */ + rcu_read_lock(); + down_write(&tsk->pnotify_subscriber_list_sem); + sub = pnotify_get_subscriber(tsk, key_events.name); + if (sub == NULL || sub->data == NULL) { /* shouldn't happen */ + printk(KERN_ERR "key_fsgid_changed pnotify subscriber or data ptr was null, task: %d\n", tsk->pid); + up_write(&tsk->pnotify_subscriber_list_sem); + rcu_read_unlock(); + return; + } + kt = (struct key_task *)sub->data; + /* update the ownership of the thread keyring */ - if (tsk->thread_keyring) { - down_write(&tsk->thread_keyring->sem); - tsk->thread_keyring->gid = tsk->fsgid; - up_write(&tsk->thread_keyring->sem); + if (kt->thread_keyring) { + down_write(&kt->thread_keyring->sem); + kt->thread_keyring->gid = tsk->fsgid; + up_write(&kt->thread_keyring->sem); } + up_write(&tsk->pnotify_subscriber_list_sem); + rcu_read_unlock(); } /* end key_fsgid_changed() */ @@ -383,6 +475,8 @@ { struct request_key_auth *rka; struct key *key, *ret, *err, *instkey; + struct pnotify_subscriber *sub; + struct key_task *kt; /* we want to return -EAGAIN or -ENOKEY if any of the keyrings were * searchable, but we failed to find a key or we found a negative key; @@ -395,12 +489,26 @@ ret = NULL; err = ERR_PTR(-EAGAIN); + rcu_read_lock(); + down_write(&context->pnotify_subscriber_list_sem); + sub = pnotify_get_subscriber(context, key_events.name); + if (sub == NULL || sub->data == NULL) { + printk(KERN_ERR "search_process_keyrings pnotify subscriber or data ptr null, task: %d\n", context->pid); + up_write(&context->pnotify_subscriber_list_sem); + rcu_read_unlock(); + return (struct key *)-EFAULT; + } + kt = (struct key_task *)sub->data; + /* search the thread keyring first */ - if (context->thread_keyring) { - key = keyring_search_aux(context->thread_keyring, + if (kt->thread_keyring) { + key = keyring_search_aux(kt->thread_keyring, context, type, description, match); - if (!IS_ERR(key)) + if (!IS_ERR(key)) { + up_write(&context->pnotify_subscriber_list_sem); + rcu_read_unlock(); goto found; + } switch (PTR_ERR(key)) { case -EAGAIN: /* no key */ @@ -414,6 +522,8 @@ break; } } + up_write(&context->pnotify_subscriber_list_sem); + rcu_read_unlock(); /* search the process keyring second */ if (context->signal->process_keyring) { @@ -535,15 +645,28 @@ { struct key *key; int ret; + struct pnotify_subscriber *sub; + struct key_task *kt; if (!context) context = current; key = ERR_PTR(-ENOKEY); + rcu_read_lock(); + down_write(&context->pnotify_subscriber_list_sem); + sub = pnotify_get_subscriber(context, key_events.name); + if (sub == NULL || sub->data == NULL) { /* shouldn't happen */ + printk(KERN_ERR "search_process_keyrings pnotify subscriber or data ptr null, task: %d\n", context->pid); + up_write(&context->pnotify_subscriber_list_sem); + rcu_read_unlock(); + return (struct key *)-EFAULT; + } + kt = (struct key_task *)sub->data; + switch (id) { case KEY_SPEC_THREAD_KEYRING: - if (!context->thread_keyring) { + if (!kt->thread_keyring) { if (!create) goto error; @@ -554,7 +677,7 @@ } } - key = context->thread_keyring; + key = kt->thread_keyring; atomic_inc(&key->usage); break; @@ -634,6 +757,8 @@ goto invalid_key; error: + up_write(&context->pnotify_subscriber_list_sem); + rcu_read_unlock(); return key; invalid_key: Index: linux/include/linux/sched.h =================================================================== --- linux.orig/include/linux/sched.h 2005-09-27 16:01:17.561027702 -0500 +++ linux/include/linux/sched.h 2005-09-27 16:05:34.093827947 -0500 @@ -718,10 +718,6 @@ kernel_cap_t cap_effective, cap_inheritable, cap_permitted; unsigned keep_capabilities:1; struct user_struct *user; -#ifdef CONFIG_KEYS - struct key *thread_keyring; /* keyring private to this thread */ - unsigned char jit_keyring; /* default keyring to attach requested keys to */ -#endif int oomkilladj; /* OOM kill score adjustment (bit shift). */ char comm[TASK_COMM_LEN]; /* executable name excluding path - access with [gs]et_task_comm (which lock Index: linux/security/keys/keyctl.c =================================================================== --- linux.orig/security/keys/keyctl.c 2005-09-27 16:01:17.563957095 -0500 +++ linux/security/keys/keyctl.c 2005-09-27 16:18:13.257708245 -0500 @@ -931,31 +931,57 @@ long keyctl_set_reqkey_keyring(int reqkey_defl) { int ret; + unsigned char jit_return; + struct pnotify_subscriber *sub; + struct key_task *kt; + + rcu_read_lock(); + down_write(¤t->pnotify_subscriber_list_sem); + sub = pnotify_get_subscriber(current, key_events.name); + if (sub == NULL || sub->data == NULL) { /* shouldn't happen */ + printk(KERN_ERR "keyctl_set_reqkey_keyring pnotify subscriber or data ptr null, task: %d\n", current->pid); + up_write(¤t->pnotify_subscriber_list_sem); + rcu_read_unlock(); + return -EFAULT; + } + kt = (struct key_task *)sub->data; switch (reqkey_defl) { case KEY_REQKEY_DEFL_THREAD_KEYRING: ret = install_thread_keyring(current); - if (ret < 0) + if (ret < 0) { + up_write(¤t->pnotify_subscriber_list_sem); + rcu_read_unlock(); return ret; + } goto set; case KEY_REQKEY_DEFL_PROCESS_KEYRING: ret = install_process_keyring(current); - if (ret < 0) + if (ret < 0) { + up_write(¤t->pnotify_subscriber_list_sem); + rcu_read_unlock(); return ret; + } case KEY_REQKEY_DEFL_DEFAULT: case KEY_REQKEY_DEFL_SESSION_KEYRING: case KEY_REQKEY_DEFL_USER_KEYRING: case KEY_REQKEY_DEFL_USER_SESSION_KEYRING: set: - current->jit_keyring = reqkey_defl; + + kt->jit_keyring = reqkey_defl; case KEY_REQKEY_DEFL_NO_CHANGE: - return current->jit_keyring; + jit_return = kt->jit_keyring; + up_write(¤t->pnotify_subscriber_list_sem); + rcu_read_unlock(); + return jit_return; case KEY_REQKEY_DEFL_GROUP_KEYRING: default: + up_write(¤t->pnotify_subscriber_list_sem); + rcu_read_unlock(); return -EINVAL; } Index: linux/security/keys/request_key.c =================================================================== --- linux.orig/security/keys/request_key.c 2005-09-27 16:01:17.563957095 -0500 +++ linux/security/keys/request_key.c 2005-09-27 16:18:37.840196240 -0500 @@ -14,6 +14,7 @@ #include <linux/kmod.h> #include <linux/err.h> #include <linux/keyctl.h> +#include <linux/pnotify.h> #include "internal.h" struct key_construction { @@ -39,6 +40,19 @@ char *argv[10], *envp[3], uid_str[12], gid_str[12]; char key_str[12], keyring_str[3][12]; int ret, i; + struct pnotify_subscriber *sub; + struct key_task *kt; + + rcu_read_lock(); + down_write(&tsk->pnotify_subscriber_list_sem); + sub = pnotify_get_subscriber(current, key_events.name); + if (sub == NULL || sub->data == NULL) { /* shouldn't happen */ + printk(KERN_ERR "call_request_key pnotify subscriber or data ptr null, task: %d\n", tsk->pid); + up_write(&tsk->pnotify_subscriber_list_sem); + rcu_read_unlock(); + return -EFAULT; + } + kt = (struct key_task *)sub->data; kenter("{%d},%s,%s", key->serial, op, callout_info); @@ -58,7 +72,7 @@ /* we specify the process's default keyrings */ sprintf(keyring_str[0], "%d", - tsk->thread_keyring ? tsk->thread_keyring->serial : 0); + kt->thread_keyring ? kt->thread_keyring->serial : 0); prkey = 0; if (tsk->signal->process_keyring) @@ -105,6 +119,8 @@ key_put(session_keyring); error: + up_write(&tsk->pnotify_subscriber_list_sem); + rcu_read_unlock(); kleave(" = %d", ret); return ret; @@ -300,15 +316,28 @@ { struct task_struct *tsk = current; struct key *drop = NULL; + struct pnotify_subscriber *sub; + struct key_task *kt; + + rcu_read_lock(); + down_write(&tsk->pnotify_subscriber_list_sem); + sub = pnotify_get_subscriber(current, key_events.name); + if (sub == NULL || sub->data == NULL) { /* shouldn't happen */ + printk(KERN_ERR "request_key_link pnotify subscriber or data ptr null, task: %d\n", tsk->pid); + up_write(&tsk->pnotify_subscriber_list_sem); + rcu_read_unlock(); + return; + } + kt = (struct key_task *)sub->data; kenter("{%d},%p", key->serial, dest_keyring); /* find the appropriate keyring */ if (!dest_keyring) { - switch (tsk->jit_keyring) { + switch (kt->jit_keyring) { case KEY_REQKEY_DEFL_DEFAULT: case KEY_REQKEY_DEFL_THREAD_KEYRING: - dest_keyring = tsk->thread_keyring; + dest_keyring = kt->thread_keyring; if (dest_keyring) break; @@ -347,6 +376,8 @@ key_put(drop); kleave(""); + down_write(&tsk->pnotify_subscriber_list_sem); + rcu_read_unlock(); } /* end request_key_link() */ |
From: Erik J. <er...@sg...> - 2005-09-29 17:03:49
|
In my tests, I have found no mesaurable difference in performance between stock 2.6.14-rc2, pnotify with the subscriber list protected by the rwsem lock like it used to be, and pnotify with read protections implemented with RCU. Problem: Job allocates memory and uses rw semaphores quite frequently. This seems to be illegal from within rcu_read_lock() / rcu_read_unlock(). Running with CONFIG_DEBUG_SPINLOCK_SLEEP confirms this. I'm not sure some of the things we want to do in Job are possible (or at least not easily possible) if this kernel module subscriber can never sleep. Even the keyring infrastructure needs to allocate memory sometimes. My opinion is: Let's not use the RCU version of pnotify and stick with the rwsem version. - Since you aren't supposed to sleep, many (most) users can't make use of pnotify. - The performance data shows no difference between the rwsem version with two subscribers (keyring and job) vs a stock kernel with the non-modified keyring support enabled. - Using RCU for protection adds comlexity Here are the details on the test runs. The RCU version "illegally" can sleep but I didn't hit any problems in my tests. IMPORTANT: The version of AIM7 used here is not currently tracking the community version - ours is closer to the original AIM7. General Test info: - For the two tests with pnotify (old rwsem protected subscriber list and new wsem/rcu protected subscriber list), the tests were always fired off with the shell process being in a job container. This means the tests don't only compare RCU performance, but also some aspects of Linux job performance. All child processes will have Job and keyring as subscribers. Of course, the stock kernel doesn't have pnotify or job and uses the non-modified keyring. - forkexit is simple and just forks and exits the number of times supplied on the command line. It reports the number of times fork returned -1. - The 2.6.14-rc2 kernel was used for each test. The only variations in patches and configuration related to which version of keyring, pnotify, and Job were used (if any). - jobtest is a mini job test suite. One of the tests forks/exits enough processes for PIDs to wrap. A job_create is done for each forked process, which means job is always a subscribed kernenl module to the forked processes. I snipped out the exessive output in the results. - keyring support is enabled in all kernels. For both pnotify kernel versions, keyrings have been changed to support pnotify for task struct entries and fork/exit calls. - All kernels include the kdb patch. Otherwise, all kerenels used the sn2_defconfig as a base. - The test system was a 2-processor 1400 Mhz SGI Altix 350 (ia64) pnotify & linux kernel job with RCU, pnotify-aware keyring ------------------------------------------------------------------------------ forkexit test runs (supplied number is the number of forks fired). minime1:/tmp # time ./forkexit 100000 Fork returned an error: 92468 times real 0m0.916s user 0m0.036s sys 0m0.408s minime1:/tmp # time ./forkexit 100000 Fork returned an error: 92469 times real 0m0.914s user 0m0.024s sys 0m0.412s minime1:/tmp # time ./forkexit 100000 Fork returned an error: 92468 times real 0m0.909s user 0m0.060s sys 0m0.332s minime1:/tmp # time ./forkexit 100000 Fork returned an error: 92467 times real 0m0.911s user 0m0.052s sys 0m0.380s minime1:~ # time jobtest [snip] Great. All tests completed, no errors real 0m11.339s user 0m0.524s sys 0m4.536s minime1:~ # time jobtest [snip] Great. All tests completed, no errors real 0m11.385s user 0m0.500s sys 0m4.592s Linux version 2.6.14-rc2 (erikj@attica) (gcc version 3.3.3 (SuSE Linux)) #1 SMP PREEMPT Wed Sep 28 12:11:05 CDT 2005 HOST = minime1 CPUS = 2 DIRS = 1 DISKS= 0 FS = xfs SCSI = non-xscsi ID = RCU pnotify+job Run 1 of 1 Benchmark Version Machine Run Date AIM Multiuser Benchmark - Suite VII "1.1" AIM7 Run Sep 28 16:00:45 2005 Tasks Jobs/Min JTI Real CPU Jobs/sec/task 1 1962.2 100 3.0 0.5 32.7040 2 4041.7 99 2.9 0.8 33.6806 3 5734.0 98 3.0 1.2 31.8555 4 7232.1 99 3.2 1.5 30.1336 5 8591.7 96 3.4 1.8 28.6389 10 13655.6 93 4.3 3.4 22.7593 20 19812.8 92 5.9 6.7 16.5106 50 26626.4 90 10.9 16.5 8.8755 100 30021.7 87 19.4 32.8 5.0036 150 31080.9 86 28.1 49.1 3.4534 200 31727.0 84 36.7 65.2 2.6439 500 32047.0 84 90.8 163.1 1.0682 1000 31954.7 83 182.1 325.8 0.5326 2000 32633.4 83 356.7 653.9 0.2719 pnotify & linux kernel job WITHOUT RCU - all rwsem, pnotify-aware keyring ------------------------------------------------------------------------------ forkexit test runs (supplied number is the number of forks fired). minime1:/tmp # time ./forkexit 100000 Fork returned an error: 92468 times real 0m0.913s user 0m0.028s sys 0m0.428s minime1:/tmp # time ./forkexit 100000 Fork returned an error: 92470 times real 0m1.032s user 0m0.044s sys 0m0.416s minime1:/tmp # time ./forkexit 100000 Fork returned an error: 92469 times real 0m1.048s user 0m0.048s sys 0m0.368s minime1:/tmp # time ./forkexit 100000 Fork returned an error: 92469 times real 0m0.948s user 0m0.028s sys 0m0.444s minime1:/tmp # time jobtest [snip] Great. All tests completed, no errors real 0m11.620s user 0m0.612s sys 0m4.540s minime1:/tmp # time jobtest [snip] Great. All tests completed, no errors real 0m11.268s user 0m0.548s sys 0m4.488s Linux version 2.6.14-rc2 (erikj@attica) (gcc version 3.3.3 (SuSE Linux)) #2 SMP PREEMPT Wed Sep 28 15:01:33 CDT 2005 HOST = minime1 CPUS = 2 DIRS = 1 DISKS= 0 FS = xfs SCSI = non-xscsi ID = non-RCU pnotify+job Run 1 of 1 Benchmark Version Machine Run Date AIM Multiuser Benchmark - Suite VII "1.1" AIM7 Run Sep 28 17:17:56 2005 Tasks Jobs/Min JTI Real CPU Jobs/sec/task 1 1964.9 100 3.0 0.5 32.7481 2 4020.7 99 2.9 0.8 33.5060 3 5630.4 98 3.1 1.2 31.2802 4 7218.6 97 3.2 1.5 30.0775 5 8637.6 97 3.4 1.8 28.7919 10 13857.1 94 4.2 3.5 23.0952 20 19429.1 93 6.0 6.7 16.1910 50 26126.8 88 11.1 16.5 8.7089 100 29950.6 86 19.4 32.7 4.9918 150 31052.1 86 28.1 49.1 3.4502 200 31817.2 85 36.6 65.4 2.6514 500 32101.1 84 90.7 163.1 1.0700 1000 31908.8 83 182.4 325.8 0.5318 2000 32718.3 82 355.8 653.7 0.2727 linux kernel without pnotify and without job, non-modified keyring enabled ------------------------------------------------------------------------------ forkexit test runs (supplied number is the number of forks fired). minime1:/tmp # time ./forkexit 100000 Fork returned an error: 92467 times real 0m0.910s user 0m0.024s sys 0m0.356s minime1:/tmp # time ./forkexit 100000 Fork returned an error: 92467 times real 0m0.912s user 0m0.032s sys 0m0.472s minime1:/tmp # time ./forkexit 100000 Fork returned an error: 92467 times real 0m0.915s user 0m0.044s sys 0m0.400s minime1:/tmp # time ./forkexit 100000 Fork returned an error: 92469 times real 0m0.923s user 0m0.036s sys 0m0.388s (no jobtest tests since kernel doesn't have pnotify or job) --------------------------------------------------------------------------- Linux version 2.6.14-rc2 (erikj@attica) (gcc version 3.3.3 (SuSE Linux)) #2 SMP PREEMPT Wed Sep 28 14:53:02 CDT 2005 HOST = minime1 CPUS = 2 DIRS = 1 DISKS= 0 FS = xfs SCSI = non-xscsi ID = NO pnotify+job Run 1 of 1 Benchmark Version Machine Run Date AIM Multiuser Benchmark - Suite VII "1.1" AIM7 Run Sep 28 18:36:09 2005 Tasks Jobs/Min JTI Real CPU Jobs/sec/task 1 1964.2 100 3.0 0.5 32.7371 2 4024.9 99 2.9 0.8 33.5408 3 5539.3 99 3.2 1.2 30.7741 4 7123.6 96 3.3 1.5 29.6818 5 8617.1 95 3.4 1.8 28.7237 10 13713.5 96 4.2 3.4 22.8558 20 19732.2 92 5.9 6.7 16.4435 50 26411.3 88 11.0 16.4 8.8038 100 30001.5 86 19.4 32.7 5.0003 150 31184.1 86 28.0 49.0 3.4649 200 31714.9 84 36.7 65.4 2.6429 500 32160.7 84 90.5 163.0 1.0720 1000 31876.8 83 182.6 325.7 0.5313 2000 32708.1 83 355.9 653.0 0.2726 |
From: Erik J. <er...@sg...> - 2005-09-30 18:45:15
|
New performance tests. These tests were run on a 32-processor Altix system. It has 1300 MHz CPUs. Note that the individual CPUs are slower than my earlier testing with the 2p box. fork-bomb-like tests are slower on big NUMA machines like this unless you restrict them to running CPUs within the same node (and this is why the 2p box does better in that respect). I know it was requested that I run this on a behemoth - I'm just hoping 32p is big enough. The bigger the machine, the harder it is to reserve time... As seen in the thread, my first pass at an RCU implementation isn't good enough to use for real. I'm hoping it is good enough to get us some numbers. Conclusions: 1. My feeling is that RCU isn't buying us much here. In fact, it restricts what kernel modules can do in that you can't sleep in many interesting places. This affects things such as kmalloc's (unless you say you aren't willing to wait)... It also makes it so you can't use rwsems in all the places you might need to. 2. In the data below, unlike the 2p test, there are some differences. My conclusion here, and the reason I added an additional test, is that the Job infrastructure is the cause for a small slow down. So I added a kernel test where pnotify still has two kernel module subscribers per task, but Job isn't one of them. Job is replaced by a tiny kernel module that just simply counts the number of times the fork, exit, and exec hooks fire. This takes Job out of the equation. That kernel closely tracks the stock kernel. That means pnotify isn't at fault, Job is. In other words, if Job implemented its own hooks and was configured on, we'd probably see the same small dip. 3. The fork/exit tests (jump to the end for a summary) show stock performing a tiny bit better than pnotify with two subscribers and no job (keyrings and a test module). I suspect if I did a test where pnotify only had keyrings as a subscriber, the numbers would be nearly the same. The AIM7 data shows these two kernels as very similar. 4. So I feel pnotify isn't really costing us but the data does show we need to keep an eye on pnotify users. We need to treat new pnotify users in a similar way as new callouts in exit, copy_process, and exec when the pnotify user plans to associate with all or many running tasks (at least for performance reasons) I'm going to integrate other comments I got on the pnotify patch now and send a new version of the non-rcu pnotify patch soon. Some info on the tests: jobtest: a mini job test suite that includes forking enough processes to make the PID numbers roll. Output trimmed. forkexit: just forks and exits the specified number of times. fork-wait-exit: is the same, but the parent waits for the child. AIM7: The version of aim7 we use is not tracking the current community release. All kernels had kdb patches applied and kdb was enabled. 4 kernels tested: 2.6.14-rc2 with pnotify, job, and pnotify-aware keyrings, original NON-RCU 2.6.14-rc2 with pnotify, job, and pnotify-aware keyrings, RCU Testing 2.6.14-rc2 stock (no pnotify, no job, non-modified keyrings enabled) 2.6.14-rc2 non-rcu pnotify, NO Job, pnotify-aware keyrings, tiny pnotify module In the last kernel, I'm showing a pnotify user that only does atomic adds and subtracts on a few variables and nothing else. The test module, like keyrings, is a subscriber to all processes. The purpose of this test is to take the Job infrastructure out of the performance picture to just focus on pnotify. This tiny test module counts the number of times fork, exit, and exec fires. Output is provided. 2.6.14-rc2 pnotify, job, pnotify-aware keyrings implementation NOT using rcu ------------------------------------------------------------------------------ === jobtest === belay:~ # time jobtest [snip] Great. All tests completed, no errors real 0m26.222s user 0m1.736s sys 0m21.380s belay:~ # time jobtest [snip] Great. All tests completed, no errors real 0m35.574s user 0m2.512s sys 0m31.356s belay:~ # time jobtest [snip] Great. All tests completed, no errors real 0m35.696s user 0m2.556s sys 0m31.128s === forkexit === belay:~ # time ./forkexit 40000 Fork returned an error: 7720 times real 0m14.665s user 0m0.072s sys 0m14.464s belay:~ # time ./forkexit 40000 Fork returned an error: 7720 times real 0m15.439s user 0m0.104s sys 0m15.212s belay:~ # time ./forkexit 40000 Fork returned an error: 7720 times real 0m15.115s user 0m0.068s sys 0m14.924s === fork-wait-exit === belay:~ # time ./fork-wait-exit 40000 Fork returned an error: 0 times real 0m25.494s user 0m2.136s sys 0m31.588s belay:~ # time ./fork-wait-exit 40000 Fork returned an error: 0 times real 0m25.208s user 0m1.760s sys 0m28.208s === AIM7 === Linux version 2.6.14-rc2 (erikj@attica) (gcc version 3.3.3 (SuSE Linux)) #2 SMP PREEMPT Wed Sep 28 15:01:33 CDT 2005 HOST = belay CPUS = 32 DIRS = 1 DISKS= 0 FS = xfs SCSI = non-xscsi ID = pnotify+job+new-keyring, NON-RCU Run 1 of 1 Benchmark Version Machine Run Date AIM Multiuser Benchmark - Suite VII "1.1" AIM7 Run Sep 29 16:03:37 2005 Tasks Jobs/Min JTI Real CPU Jobs/sec/task 1 1907.6 100 3.1 0.6 31.7929 2 3694.1 96 3.2 1.2 30.7839 3 5643.2 96 3.1 1.5 31.3510 4 7994.5 99 2.9 1.8 33.3104 5 9417.5 97 3.1 2.4 31.3916 10 18976.2 97 3.1 4.7 31.6270 20 37439.7 98 3.1 9.1 31.1997 50 74348.5 94 3.9 23.6 24.7828 100 101694.9 90 5.7 46.5 16.9492 150 124643.1 87 7.0 67.8 13.8492 200 130669.1 86 8.9 90.3 10.8891 500 154877.9 83 18.8 227.9 5.1626 1000 150566.6 80 38.7 453.3 2.5094 2000 154665.9 78 75.3 937.6 1.2889 2.6.14-rc2 pnotify, job, pnotify-aware keyrings implementation with RCU ------------------------------------------------------------------------------ === jobtest === belay:~ # time jobtest [snip] Great. All tests completed, no errors real 0m26.501s user 0m1.608s sys 0m20.596s belay:~ # time jobtest [snip] Great. All tests completed, no errors real 0m35.501s user 0m2.320s sys 0m30.148s belay:~ # time jobtest [snip] Great. All tests completed, no errors real 0m35.563s user 0m2.576s sys 0m30.632s === forkexit === --> Note: The first attempt took 1 minutes 23 seconds. This only --> happened once. With the stock kernel test below, one attempt was --> an outlier taking more than 5 minutes. Therefore, I don't think it --> is anything I changed that caused this. belay:~ # time ./forkexit 40000 Fork returned an error: 7701 times real 1m22.814s user 0m0.064s sys 1m22.740s belay:~ # time ./forkexit 40000 Fork returned an error: 7693 times real 0m14.557s user 0m0.076s sys 0m14.368s belay:~ # time ./forkexit 40000 Fork returned an error: 7693 times real 0m14.774s user 0m0.076s sys 0m14.580s belay:~ # time ./forkexit 40000 Fork returned an error: 7693 times real 0m15.218s user 0m0.112s sys 0m14.992s === fork-wait-exit === belay:~ # time ./fork-wait-exit 40000 Fork returned an error: 0 times real 0m24.922s user 0m1.560s sys 0m28.060s belay:~ # time ./fork-wait-exit 40000 Fork returned an error: 0 times real 0m25.178s user 0m1.680s sys 0m27.988s === AIM7 === nux version 2.6.14-rc2 (erikj@attica) (gcc version 3.3.3 (SuSE Linux)) #5 SMP PREEMPT Thu Sep 29 15:00:00 CDT 2005 HOST = belay CPUS = 32 DIRS = 1 DISKS= 0 FS = xfs SCSI = non-xscsi ID = RCU pnotify+job+new-keyring Run 1 of 1 Benchmark Version Machine Run Date AIM Multiuser Benchmark - Suite VII "1.1" AIM7 Run Sep 29 16:42:08 2005 Tasks Jobs/Min JTI Real CPU Jobs/sec/task 1 1903.8 100 3.1 0.6 31.7305 2 3655.8 96 3.2 1.2 30.4648 3 5628.6 96 3.1 1.6 31.2701 4 7553.5 96 3.1 2.1 31.4731 5 10003.4 99 2.9 2.2 33.3448 10 18945.3 97 3.1 4.7 31.5755 20 36823.8 97 3.2 9.2 30.6865 50 72532.4 94 4.0 23.4 24.1775 100 97552.8 90 6.0 46.7 16.2588 150 120981.2 87 7.2 68.4 13.4424 200 130581.1 86 8.9 88.7 10.8818 500 151665.2 83 19.2 228.0 5.0555 1000 153586.3 81 37.9 467.5 2.5598 2000 153454.7 79 75.9 941.7 1.2788 2.6.14-rc2 stock+kdb, no pnotify, no job, unmodified keyrings enabled ------------------------------------------------------------------------------ === jobtest === None: kernel doesn't have pnotify or job === forkexit === --> My first attempt on a stock kernel+kdb took more than 5 minutes. The --> data isn't useful because I escaped in to kdb to check on some things. --> There was a similar outlier for my RCU pnotify kernel above that took --> 1 minute 23 above. Later runs were normal. belay:~ # time ./forkexit 40000 Fork returned an error: 7698 times real 0m14.421s user 0m0.088s sys 0m14.220s belay:~ # time ./forkexit 40000 Fork returned an error: 7699 times real 0m14.282s user 0m0.064s sys 0m14.100s belay:~ # time ./forkexit 40000 Fork returned an error: 7699 times real 0m15.736s user 0m0.072s sys 0m15.648s === fork-wait-exit === --> The 16.838 was an outlier I was never able to duplicate. --> I tried many times. Of course, if I resrict it to just two processors --> on a single node, it's faster. Perhaps we got lucky once. belay:~ # time ./fork-wait-exit 40000 Fork returned an error: 0 times real 0m16.838s user 0m1.080s sys 0m17.092s belay:~ # time ./fork-wait-exit 40000 Fork returned an error: 0 times real 0m24.599s user 0m1.728s sys 0m28.248s belay:~ # time ./fork-wait-exit 40000 Fork returned an error: 0 times real 0m24.742s user 0m1.668s sys 0m27.332s belay:~ # time ./fork-wait-exit 40000 Fork returned an error: 0 times real 0m24.777s user 0m1.720s sys 0m26.548s belay:~ # time ./fork-wait-exit 40000 Fork returned an error: 0 times real 0m24.829s user 0m1.644s sys 0m27.168s === AIM7 === nux version 2.6.14-rc2 (erikj@attica) (gcc version 3.3.3 (SuSE Linux)) #2 SMP PREEMPT Wed Sep 28 14:53:02 CDT 2005 HOST = belay CPUS = 32 DIRS = 1 DISKS= 0 FS = xfs SCSI = non-xscsi ID = stock, non-modified-keyrings enabled Run 1 of 1 Benchmark Version Machine Run Date AIM Multiuser Benchmark - Suite VII "1.1" AIM7 Run Sep 29 17:38:42 2005 Tasks Jobs/Min JTI Real CPU Jobs/sec/task 1 1909.4 100 3.0 0.6 31.8241 2 3670.8 96 3.2 1.2 30.5897 3 5597.9 96 3.1 1.6 31.0997 4 7543.7 96 3.1 2.1 31.4323 5 9445.0 97 3.1 2.4 31.4833 10 19088.2 97 3.0 4.5 31.8137 20 37977.2 98 3.1 9.0 31.6476 50 73559.2 95 4.0 23.6 24.5197 100 101147.0 90 5.8 47.1 16.8578 150 120496.9 86 7.2 68.5 13.3885 200 129003.7 86 9.0 90.5 10.7503 500 149737.6 83 19.4 222.8 4.9913 1000 158010.5 80 36.8 439.7 2.6335 2000 156453.7 77 74.4 885.3 1.3038 2.6.14-rc2 non-rcu pnotify, NO Job, pnotify-aware keyrings, tiny pnotify module ------------------------------------------------------------------------------ === jobtest === None: kernel doesn't have job === forkexit === --> Like the other tests, the first run of this took longer than the other --> runs of it. belay:~ # time ./forkexit 40000 Fork returned an error: 7686 times real 1m35.260s user 0m0.076s sys 1m35.064s belay:~ # time ./forkexit 40000 Fork returned an error: 7686 times real 0m15.843s user 0m0.068s sys 0m15.652s belay:~ # time ./forkexit 40000 Fork returned an error: 7687 times real 0m14.404s user 0m0.064s sys 0m14.220s belay:~ # time ./forkexit 40000 Fork returned an error: 7687 times real 0m14.487s user 0m0.060s sys 0m14.304s === fork-wait-exit === belay:~ # time ./fork-wait-exit 40000 Fork returned an error: 0 times real 0m25.205s user 0m2.112s sys 0m30.684s belay:~ # time ./fork-wait-exit 40000 Fork returned an error: 0 times real 0m25.045s user 0m1.872s sys 0m30.208s belay:~ # time ./fork-wait-exit 40000 Fork returned an error: 0 times real 0m24.927s user 0m1.964s sys 0m29.200s === AIM7 === Linux version 2.6.14-rc2 (erikj@attica) (gcc version 3.3.3 (SuSE Linux)) #3 SMP PREEMPT Fri Sep 30 09:21:15 CDT 2005 HOST = belay CPUS = 32 DIRS = 1 DISKS= 0 FS = xfs SCSI = non-xscsi ID = non-rcu pnotify, NO Job, pnotify keyrings, tiny pnotify test module Run 1 of 1 Benchmark Version Machine Run Date AIM Multiuser Benchmark - Suite VII "1.1" AIM7 Run Sep 30 10:19:29 2005 Tasks Jobs/Min JTI Real CPU Jobs/sec/task 1 1909.4 100 3.0 0.6 31.8241 2 3694.1 96 3.2 1.2 30.7839 3 5679.9 97 3.1 1.5 31.5550 4 7538.9 96 3.1 2.2 31.4119 5 10031.0 99 2.9 2.2 33.4367 10 19088.2 97 3.0 4.5 31.8137 20 36964.1 97 3.1 9.2 30.8034 50 72424.1 95 4.0 23.9 24.1414 100 102410.7 90 5.7 46.1 17.0684 150 120148.6 86 7.3 67.4 13.3498 200 131927.9 87 8.8 89.5 10.9940 500 154229.4 83 18.9 220.7 5.1410 1000 158552.9 80 36.7 440.5 2.6425 2000 152884.3 78 76.1 882.8 1.2740 === Special note for this kernel === After all the testing, I removed my tiny test kernel module that did atomic increments to count the number of times various hooks happened. Here is the info from that (from dmesg): Unregistering pnotify support for (name=pnotify-test) exit called 797136 times... fork called 796785 times... init called 351 times... exec called 97675 times ... Good - fork count + init count equals exit count. ------------------------------------------------------------------------------ I really would need more trials to see if these converge but I think we're covered on data with the AIM runs. Let me know if more data is requested. It appears stock is performing best but pnotify with two subscribers (keyrings, which is present in stock and my test module that atomic increments counters for the callbacks) is very close. I suspect if I ran another test where only one subscriber was present to match what was in stock, the numbers would be nearly the same. forkexit summary (real time average minus outliers): pnotify+job+pnotify keyrings, NON-RCU: 15.07 pnotify+job+pnotify keyrings, RCU: 14.85 stock: 14.813 pnotify+pnotify keyrings+test mod, non-rcu: 14.911 fork-wait-exit summary (real time average minus outlier): pnotify+job+pnotify keyrings, NON-RCU: 25.351 pnotify+job+pnotify keyrings, RCU: 25.05 stock: 24.703 pnotify+pnotify keyrings+test mod, non-rcu: 25.059 |
From: Erik J. <er...@sg...> - 2005-10-02 20:23:47
|
Based on feedback obtained so far, I have posted a number of different versions of a few patches. In the performance discussion, I show that I don't belive RCU is faster by showing the numbers for RCU based job are similar to those of original rwsem version. The grain of salt there is that my RCU implementation wasn't complete as my Job patch couldn't fully use it within the restrictions of 'no sleep during rcu_read_locks'. The test runs were done on a 2p and 32p ia64 box - see the performance subthread of this thread. I implemented two versions of a keyring patch showing how an existing piece of the kernel may make use of pnotify - one for the RCU version and one for the original version. Because my feeling, after doing this research, is that the RCU version of pnotify may not be the best fit, I went ahead and implemented the other feedback gathered from this list, the pagg mailing list, and a co-worker. Here is a cleaned up pnotify patch that is _not_ using RCU. Signed-off-by: Erik Jacobson <er...@sg...> --- Documentation/pnotify.txt | 368 +++++++++++++++++++++++++++++ fs/exec.c | 2 include/linux/init_task.h | 2 include/linux/pnotify.h | 227 ++++++++++++++++++ include/linux/sched.h | 5 init/Kconfig | 8 kernel/Makefile | 1 kernel/exit.c | 4 kernel/fork.c | 17 + kernel/pnotify.c | 568 ++++++++++++++++++++++++++++++++++++++++++++++ 10 files changed, 1201 insertions(+), 1 deletion(-) Index: linux/fs/exec.c =================================================================== --- linux.orig/fs/exec.c 2005-09-30 14:57:55.097213456 -0500 +++ linux/fs/exec.c 2005-09-30 14:57:57.629184199 -0500 @@ -48,6 +48,7 @@ #include <linux/syscalls.h> #include <linux/rmap.h> #include <linux/acct.h> +#include <linux/pnotify.h> #include <asm/uaccess.h> #include <asm/mmu_context.h> @@ -1203,6 +1204,7 @@ retval = search_binary_handler(bprm,regs); if (retval >= 0) { free_arg_pages(bprm); + pnotify_exec(current); /* execve success */ security_bprm_free(bprm); Index: linux/include/linux/init_task.h =================================================================== --- linux.orig/include/linux/init_task.h 2005-09-30 14:57:55.098189920 -0500 +++ linux/include/linux/init_task.h 2005-09-30 14:57:57.636019445 -0500 @@ -2,6 +2,7 @@ #define _LINUX__INIT_TASK_H #include <linux/file.h> +#include <linux/pnotify.h> #include <linux/rcupdate.h> #define INIT_FDTABLE \ @@ -121,6 +122,7 @@ .proc_lock = SPIN_LOCK_UNLOCKED, \ .journal_info = NULL, \ .cpu_timers = INIT_CPU_TIMERS(tsk.cpu_timers), \ + INIT_TASK_PNOTIFY(tsk) \ .fs_excl = ATOMIC_INIT(0), \ } Index: linux/include/linux/sched.h =================================================================== --- linux.orig/include/linux/sched.h 2005-09-30 14:57:55.098189920 -0500 +++ linux/include/linux/sched.h 2005-09-30 15:25:34.616252251 -0500 @@ -795,6 +795,11 @@ struct mempolicy *mempolicy; short il_next; #endif +#ifdef CONFIG_PNOTIFY +/* List of pnotify kernel module subscribers */ + struct list_head pnotify_subscriber_list; + struct rw_semaphore pnotify_subscriber_list_sem; +#endif #ifdef CONFIG_CPUSETS struct cpuset *cpuset; nodemask_t mems_allowed; Index: linux/init/Kconfig =================================================================== --- linux.orig/init/Kconfig 2005-09-30 14:57:55.099166384 -0500 +++ linux/init/Kconfig 2005-09-30 15:25:34.489311959 -0500 @@ -162,6 +162,14 @@ for processing it. A preliminary version of these tools is available at <http://www.physik3.uni-rostock.de/tim/kernel/utils/acct/>. +config PNOTIFY + bool "Support for Process Notification" + help + Say Y here if you will be loading modules which provide support + for process notification. Examples of such modules include the + Linux Jobs module and the Linux Array Sessions module. If you will not + be using such modules, say N. + config SYSCTL bool "Sysctl support" ---help--- Index: linux/kernel/Makefile =================================================================== --- linux.orig/kernel/Makefile 2005-09-30 14:57:55.100142848 -0500 +++ linux/kernel/Makefile 2005-09-30 15:25:34.490288423 -0500 @@ -20,6 +20,7 @@ obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o obj-$(CONFIG_KEXEC) += kexec.o obj-$(CONFIG_COMPAT) += compat.o +obj-$(CONFIG_PNOTIFY) += pnotify.o obj-$(CONFIG_CPUSETS) += cpuset.o obj-$(CONFIG_IKCONFIG) += configs.o obj-$(CONFIG_IKCONFIG_PROC) += configs.o Index: linux/kernel/fork.c =================================================================== --- linux.orig/kernel/fork.c 2005-09-30 14:57:55.100142848 -0500 +++ linux/kernel/fork.c 2005-09-30 15:54:50.502255817 -0500 @@ -42,6 +42,7 @@ #include <linux/profile.h> #include <linux/rmap.h> #include <linux/acct.h> +#include <linux/pnotify.h> #include <asm/pgtable.h> #include <asm/pgalloc.h> @@ -151,6 +152,9 @@ init_task.signal->rlim[RLIMIT_NPROC].rlim_max = max_threads/2; init_task.signal->rlim[RLIMIT_SIGPENDING] = init_task.signal->rlim[RLIMIT_NPROC]; + + /* Initialize the pnotify list in pid 0 before it can clone itself. */ + INIT_PNOTIFY_LIST(current); } static struct task_struct *dup_task_struct(struct task_struct *orig) @@ -1039,6 +1043,15 @@ p->exit_state = 0; /* + * Call pnotify kernel module subscribers and add the same subscribers the + * parent has to the new process. + * Fail the fork on error. + */ + retval = pnotify_fork(p, current); + if (retval) + goto bad_fork_cleanup_namespace; + + /* * Ok, make it visible to the rest of the system. * We dont wake it up yet. */ @@ -1073,7 +1086,7 @@ if (sigismember(¤t->pending.signal, SIGKILL)) { write_unlock_irq(&tasklist_lock); retval = -EINTR; - goto bad_fork_cleanup_namespace; + goto bad_fork_cleanup_pnotify; } /* CLONE_PARENT re-uses the old parent */ @@ -1159,6 +1172,8 @@ return ERR_PTR(retval); return p; +bad_fork_cleanup_pnotify: + pnotify_exit(p); bad_fork_cleanup_namespace: exit_namespace(p); bad_fork_cleanup_keys: Index: linux/kernel/exit.c =================================================================== --- linux.orig/kernel/exit.c 2005-09-30 14:57:55.100142848 -0500 +++ linux/kernel/exit.c 2005-09-30 15:25:34.617228715 -0500 @@ -29,6 +29,7 @@ #include <linux/proc_fs.h> #include <linux/mempolicy.h> #include <linux/cpuset.h> +#include <linux/pnotify.h> #include <linux/syscalls.h> #include <linux/signal.h> @@ -866,6 +867,9 @@ module_put(tsk->binfmt->module); tsk->exit_code = code; + + pnotify_exit(tsk); + exit_notify(tsk); #ifdef CONFIG_NUMA mpol_free(tsk->mempolicy); Index: linux/kernel/pnotify.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux/kernel/pnotify.c 2005-09-30 16:01:50.518408112 -0500 @@ -0,0 +1,568 @@ +/* + * Process Notification (pnotify) interface + * + * + * Copyright (c) 2000-2005 Silicon Graphics, Inc. All Rights Reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * Contact information: Silicon Graphics, Inc., 1500 Crittenden Lane, + * Mountain View, CA 94043, or: + * + * http://www.sgi.com + */ + +#include <linux/config.h> +#include <linux/module.h> +#include <linux/pnotify.h> +#include <linux/sched.h> +#include <linux/slab.h> +#include <asm/semaphore.h> + +/* list of pnotify event list entries that reference the "module" + * implementations */ +static LIST_HEAD(pnotify_event_list); +static DECLARE_RWSEM(pnotify_event_list_sem); + + +/** + * pnotify_get_subscriber - get a pnotify subscriber given a search key + * @task: We examine the pnotify_subscriber_list from the given task + * @key: Key name of kernel module subscriber we wish to retrieve + * + * Given a pnotify_subscriber_list structure, this function will return + * a pointer to the kernel module pnotify_subsciber struct that matches the + * search key. If the key is not found, the function will return NULL. + * + * Locking: This is a pnotify_subscriber_list reader. This function should + * be called with at least a read lock on the pnotify_subscriber_list using + * down_read(&task->pnotify_subscriber_list_sem). + * + */ +struct pnotify_subscriber * +pnotify_get_subscriber(struct task_struct *task, char *key) +{ + struct pnotify_subscriber *subscriber; + + list_for_each_entry(subscriber, &task->pnotify_subscriber_list, entry) { + if (!strcmp(subscriber->events->name,key)) + return subscriber; + } + return NULL; +} + + +/** + * pnotify_subscribe - Add kernel module to the subscriber list for process + * @task: Task that gets the new kernel module subscriber added to the list + * @events: pnotify_events structure to associate with kernel module + * + * Given a task and a pnotify_events structure, this function will allocate + * a new pnotify_subscriber, initialize the settings, and insert it into + * the pnotify_subscriber_list for the task. + * + * Locking: + * The caller for this function should hold at least a read lock on the + * pnotify_event_list_sem - or ensure that the pnotify_events entry cannot be + * removed. If this function was called from the pnotify module (usually the + * case), then the caller need not hold this lock because the event + * structure won't disappear until pnotify_unregister is called. + * + * This is a pnotify_subscriber_list WRITER. The caller must hold a write + * lock on for the tasks pnotify_subscriber_list_sem. This can be locked + * using down_write(&task->pnotify_subscriber_list_sem). + */ +struct pnotify_subscriber * +pnotify_subscribe(struct task_struct *task, struct pnotify_events *events) +{ + struct pnotify_subscriber *subscriber; + + subscriber = kmalloc(sizeof(struct pnotify_subscriber), GFP_KERNEL); + if (!subscriber) + return NULL; + + subscriber->events = events; + subscriber->data = NULL; + atomic_inc(&events->refcnt); /* Increase hook's reference count */ + list_add_tail(&subscriber->entry, &task->pnotify_subscriber_list); + return subscriber; +} + + +/** + * pnotify_unsubscribe - Remove kernel module subscriber from process + * @subscriber: The subscriber to remove + * + * This function will ensure the subscriber is deleted form + * the list of subscribers for the task. Finally, the memory for the + * subscriber is discarded. + * + * Prior to calling pnotify_unsubscribe, the subscriber should have been + * detached from any uses the kernel module may have. This is often done using + * p->events->exit(task, subscriber); + * + * Locking: + * This is a pnotify_subscriber_list WRITER. The caller of this function must + * hold a write lock on the pnotify_subscriber_list_sem for the task. This can + * be locked using down_write(&task->pnotify_subscriber_list_sem). Because + * events are referenced, the caller should ensure the events structure + * doesn't disappear. If the caller is a pnotify module, the events + * structure won't disappear until pnotify_unregister is called so it's safe + * not to lock the pnotify_event_list_sem. + * + * + */ +void +pnotify_unsubscribe(struct pnotify_subscriber *subscriber) +{ + atomic_dec(&subscriber->events->refcnt); /* dec events ref count */ + list_del(&subscriber->entry); + kfree(subscriber); +} + + +/** + * pnotify_get_events - Get the pnotify_events struct matching requested name + * @key: The name of the events structure to get + * + * Given a pnotify_events struct name that represents the kernel module name, + * this function will return a pointer to the pnotify_events structure that + * matches the name. + * + * Locking: + * You should hold either the write or read lock for pnotify_event_list_sem + * before using this function. This will ensure that the pnotify_event_list + * does not change while iterating through the list entries. + * + */ +static struct pnotify_events * +pnotify_get_events(char *key) +{ + struct pnotify_events *events; + + list_for_each_entry(events, &pnotify_event_list, entry) { + if (!strcmp(events->name, key)) { + return events; + } + } + return NULL; +} + +/** + * remove_subscriber_from_all_tasks - Remove subscribers for given events struct + * @events: pnotify_events struct for subscribers to remove + * + * Given a kernel module events struct registered with pnotify, + * this function will remove all subscribers matching the events struct from + * all tasks. + * + * If there is a exit function associated with the subscriber, it is called + * before the subscriber is unsubscribed/freed. + * + * This is meant to be used by pnotify_register and pnotify_unregister + * + * Locking: This is a pnotify_subscriber_list WRITER and this function + * handles locking of the pnotify_subscriber_list_sem so callers don't + * need to. + * + */ +static void +remove_subscriber_from_all_tasks(struct pnotify_events *events) +{ + if (events == NULL) + return; + + /* Because of internal race conditions we can't guarantee + * getting every task in just one pass so we just keep going + * until there are no tasks with subscribers from this events struct + * attached. The inefficiency of this should be tempered by the fact + * that this happens at most once for each registered client. + */ + while (atomic_read(&events->refcnt) != 0) { + struct task_struct *g = NULL, *p = NULL; + + read_lock(&tasklist_lock); + do_each_thread(g, p) { + struct pnotify_subscriber *subscriber; + int task_exited; + + get_task_struct(p); + read_unlock(&tasklist_lock); + down_write(&p->pnotify_subscriber_list_sem); + subscriber = pnotify_get_subscriber(p, events->name); + if (subscriber != NULL) { + (void)events->exit(p, subscriber); + pnotify_unsubscribe(subscriber); + } + up_write(&p->pnotify_subscriber_list_sem); + read_lock(&tasklist_lock); + /* If a task exited while we were looping, its sibling + * list would be empty. In that case, we jump out of + * the do_each_thread and loop again in the outter + * while because the reference count probably isn't + * zero for the pnotify events yet. Doing it this way + * makes it so we don't hold the tasklist lock too + * long. + */ + + + task_exited = list_empty(&p->sibling); + put_task_struct(p); + if (task_exited) + goto endloop; + } while_each_thread(g, p); + endloop: + read_unlock(&tasklist_lock); + } +} + +/** + * pnotify_register - Register a new module subscriber and enter it in the list + * @events_new: The new pnotify events structure to register. + * + * Used to register a new module subscriber pnotify_events structure and enter + * it into the pnotify_event_list. The service name for a pnotify_events + * struct is restricted to 32 characters. + * + * If an "init()" function is supplied in the events struct being registered + * then the kernel module will be subscribed to all existing tasks and the + * supplied "init()" function will be applied to it. If any call to the + * supplied "init()" function returns a non zero result, the registration will + * be aborted. As part of the abort process, all subscribers belonging to the + * new client will be removed from all tasks and the supplied "detach()" + * function will be called on them. + * + * If a memory error is encountered, the module (pnotify_events structure) + * is unregistered and any tasks we became subscribed to are detached. + * + * Locking: This function is an event list writer as well as a + * pnotify_subscriber_list writer. This function does the locks itself. + * Callers don't need to. + * + */ +int +pnotify_register(struct pnotify_events *events_new) +{ + struct pnotify_events *events = NULL; + + /* Add new pnotify module to access list */ + if (!events_new) + return -EINVAL; /* error */ + if (!list_empty(&events_new->entry)) + return -EINVAL; /* error */ + if (events_new->name == NULL || strlen(events_new->name) > + PNOTIFY_NAMELN) + return -EINVAL; /* error */ + if (!events_new->fork || !events_new->exit) + return -EINVAL; /* error */ + + /* Try to insert new events entry into the events list */ + down_write(&pnotify_event_list_sem); + + events = pnotify_get_events(events_new->name); + + if (events) { + up_write(&pnotify_event_list_sem); + printk(KERN_WARNING "Attempt to register duplicate" + " pnotify support (name=%s)\n", events_new->name); + return -EBUSY; + } + + /* Okay, we can insert into the events list */ + list_add_tail(&events_new->entry, &pnotify_event_list); + /* set the ref count to zero */ + atomic_set(&events_new->refcnt, 0); + + /* Now we can call the init function (if present) for each task */ + if (events_new->init != NULL) { + struct task_struct *g = NULL, *p = NULL; + int init_result = 0; + + /* Because of internal race conditions we can't guarantee + * getting every task in just one pass so we just keep going + * until we don't find any unitialized tasks. The inefficiency + * of this should be tempered by the fact that this happens + * at most once for each registered client. + */ + read_lock(&tasklist_lock); + repeat: + do_each_thread(g, p) { + struct pnotify_subscriber *subscriber; + int task_exited; + + get_task_struct(p); + read_unlock(&tasklist_lock); + down_write(&p->pnotify_subscriber_list_sem); + subscriber = pnotify_get_subscriber(p, + events_new->name); + if (!subscriber && !(p->flags & PF_EXITING)) { + subscriber = pnotify_subscribe(p, events_new); + if (subscriber != NULL) { + init_result = events_new->init(p, + subscriber); + + /* Success, but init function pointer + * doesn't want kernel module on the + * subscriber list. */ + if (init_result > 0) { + pnotify_unsubscribe(subscriber); + } + } + else { + init_result = -ENOMEM; + } + } + up_write(&p->pnotify_subscriber_list_sem); + read_lock(&tasklist_lock); + /* Like in remove_subscriber_from_all_tasks, if the + * task disappeared on us while we were going through + * the for_each_thread loop, we need to start over + * with that loop. That's why we have the list_empty + * here */ + task_exited = list_empty(&p->sibling); + put_task_struct(p); + if (init_result < 0) + goto endloop; + if (task_exited) + goto repeat; + } while_each_thread(g, p); + endloop: + read_unlock(&tasklist_lock); + + /* + * if anything went wrong during initialisation abandon the + * registration process + */ + if (init_result < 0) { + remove_subscriber_from_all_tasks(events_new); + list_del_init(&events_new->entry); + up_write(&pnotify_event_list_sem); + + printk(KERN_WARNING "Registering pnotify support for" + " (name=%s) failed\n", events_new->name); + + return init_result; /* init function error result */ + } + } + + up_write(&pnotify_event_list_sem); + + printk(KERN_INFO "Registering pnotify support for (name=%s)\n", + events_new->name); + + return 0; /* success */ + +} + +/** + * pnotify_unregister - Unregister kernel module/pnotify_event struct + * @event_old: pnotify_event struct for the kernel module we're unregistering + * + * Used to unregister kernel module subscribers indicated by + * pnotify_events struct. Removes them from the list of kernel modules + * in pnotify_event_list. + * + * Once the events entry in the pnotify_event_list is found, subscribers for + * this kernel module have their exit functions called and will then be + * removed from the list. + * + * Locking: This functoin is a pnotify_event_list writer. It also calls + * remove_subscriber_from_all_tasks, which is a pnotify_subscriber_list + * writer. Callers don't need to hold these locks ahead of calling this + * function. + * + */ +int +pnotify_unregister(struct pnotify_events *events_old) +{ + struct pnotify_events *events; + + /* Check the validity of the arguments */ + if (!events_old) + return -EINVAL; /* error */ + if (list_empty(&events_old->entry)) + return -EINVAL; /* error */ + if (events_old->name == NULL) + return -EINVAL; /* error */ + + down_write(&pnotify_event_list_sem); + + events = pnotify_get_events(events_old->name); + + if (events && events == events_old) { + remove_subscriber_from_all_tasks(events); + list_del_init(&events->entry); + up_write(&pnotify_event_list_sem); + + printk(KERN_INFO "Unregistering pnotify support for" + " (name=%s)\n", events_old->name); + + return 0; /* success */ + } + + up_write(&pnotify_event_list_sem); + + printk(KERN_WARNING "Attempt to unregister pnotify support (name=%s)" + " failed - not found\n", events_old->name); + + return -EINVAL; /* error */ +} + + +/** + * __pnotify_fork - Add kernel module subscribe to same subscribers as parent + * @to_task: The child task that will inherit the parent's subscribers + * @from_task: The parent task + * + * Make it so a new task being constructed has the same kernel module + * subscribers of its parent. + * + * The "from" argument is the parent task. The "to" argument is the child + * task. + * + * See Documentation/pnotify.txt * for details on + * how to handle return codes from the attach function pointer. + * + * Locking: The to_task is currently in-construction, so we don't + * need to worry about write-locks. We do need to be sure the parent's + * subscriber list, which we copy here, doesn't go away on us. This function + * read-locks the pnotify_subscriber_list. Callers don't need to lock. + * + */ +int +__pnotify_fork(struct task_struct *to_task, struct task_struct *from_task) +{ + struct pnotify_subscriber *from_subscriber; + int ret; + + /* lock the parents subscriber list we are copying from */ + down_read(&from_task->pnotify_subscriber_list_sem); + + list_for_each_entry(from_subscriber, + &from_task->pnotify_subscriber_list, entry) { + struct pnotify_subscriber *to_subscriber = NULL; + + to_subscriber = pnotify_subscribe(to_task, + from_subscriber->events); + if (!to_subscriber) { + /* Failed to get memory. + * we don't force __pnotify_exit to run here because + * the child is in-consturction and not running yet. + * We don't need a write lock on the subscriber + * list because the child is in construction. + */ + pnotify_unsubscribe(to_task); + up_read(&from_task->pnotify_subscriber_list_sem); + return -ENOMEM; + } + ret = to_subscriber->events->fork(to_task, to_subscriber, + from_subscriber->data); + + if (ret < 0) { + /* Propagates to copy_process as a fork failure. + * Since the child is in consturction, we don't + * need a write lock on the subscriber list. + * __pnotify_exit isn't run because the child + * never got running, exit doesn't make sense. + */ + pnotify_unsubscribe(to_task); + up_read(&from_task->pnotify_subscriber_list_sem); + return ret; /* Fork failure */ + } + else if (ret > 0) { + /* Success, but fork function pointer in the + * pnotify_events structure doesn't want the kernel + * module subscribed. This is an in-construction + * child so we don't need to write lock */ + pnotify_unsubscribe(to_subscriber); + } + } + + /* unlock parent's subscriber list */ + up_read(&from_task->pnotify_subscriber_list_sem); + + return 0; /* success */ +} + +/** + * __pnotify_exit - Remove all subscribers from given task + * @task: Task to remove subscribers from + * + * For each subscriber for the given task, we run the function pointer + * for exit in the associated pnotify_events structure then remove the + * it from the tasks's subscriber list until all subscribers are gone. + * + * Locking: This is a pnotify_subscriber_list writer. This function + * write locks the pnotify_subscriber_list. Callers don't have to do their own + * locking. The pnotify_events structure referenced exit function is called + * with the pnotify_subscriber_list write lock held. + * + */ +void +__pnotify_exit(struct task_struct *task) +{ + struct pnotify_subscriber *subscriber; + struct pnotify_subscriber *subscribertmp; + + /* Remove ref. to subscribers from task immediately */ + down_write(&task->pnotify_subscriber_list_sem); + + list_for_each_entry_safe(subscriber, subscribertmp, + &task->pnotify_subscriber_list, entry) { + subscriber->events->exit(task, subscriber); + pnotify_unsubscribe(subscriber); + } + + up_write(&task->pnotify_subscriber_list_sem); + + return; /* 0 = success, else return last code for failure */ +} + + +/** + * __pnotify_exec - Execute exec callback for each subscriber in this task + * @task: We go through the subscriber list in the given task + * + * Used to when a process that has a subscriber list does an exec. + * The exec pointer in the events structure is optional. + * + * Locking: This is a pnotify_subscriber_list reader and implements the + * read locks itself. Callers don't need to do their own locking. The + * pnotify_events referenced exec function pointer is called in an + * environment where the pnotify_subscriber_list is read locked. + * + */ +int +__pnotify_exec(struct task_struct *task) +{ + struct pnotify_subscriber *subscriber; + + down_read(&task->pnotify_subscriber_list_sem); + + list_for_each_entry(subscriber, &task->pnotify_subscriber_list, entry) { + if (subscriber->events->exec) /* exec funct. ptr is optional */ + subscriber->events->exec(task, subscriber); + } + + up_read(&task->pnotify_subscriber_list_sem); + return 0; +} + + +EXPORT_SYMBOL_GPL(pnotify_get_subscriber); +EXPORT_SYMBOL_GPL(pnotify_subscribe); +EXPORT_SYMBOL_GPL(pnotify_unsubscribe); +EXPORT_SYMBOL_GPL(pnotify_register); +EXPORT_SYMBOL_GPL(pnotify_unregister); Index: linux/include/linux/pnotify.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux/include/linux/pnotify.h 2005-09-30 14:57:57.651642867 -0500 @@ -0,0 +1,227 @@ +/* + * Process Notification (pnotify) interface + * + * + * Copyright (c) 2000-2002, 2004-2005 Silicon Graphics, Inc. + * All Rights Reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * + * Contact information: Silicon Graphics, Inc., 1500 Crittenden Lane, + * Mountain View, CA 94043, or: + * + * http://www.sgi.com + * + * For further information regarding this notice, see: + * + * http://oss.sgi.com/projects/GenInfo/NoticeExplan + */ + +/* + * Data structure definitions and function prototypes used to implement + * process notification (pnotify). + * + * pnotify provides a method (service) for kernel modules to be notified when + * certain events happen in the life of a process. It also provides a + * data pointer that is associated with a given process. See + * Documentation/pnotify.txt for a full description. + */ + +#ifndef _LINUX_PNOTIFY_H +#define _LINUX_PNOTIFY_H + +#include <linux/sched.h> + +#ifdef CONFIG_PNOTIFY + +#define PNOTIFY_NAMELN 32 /* Max chars in PNOTIFY kernel module name */ + +#define PNOTIFY_ERROR -1 /* Error. Fork fail for pnotify_fork */ +#define PNOTIFY_OK 0 /* All is well, stay subscribed */ +#define PNOTIFY_NOSUB 1 /* All is well but don't subscribe module + * to subscriber list for the process */ + + +/** + * INIT_PNOTIFY_LIST - init a pnotify subscriber list struct after declaration + * @_l: Task struct to init the pnotify_module_subscriber_list and semaphore + * + */ +#define INIT_PNOTIFY_LIST(_l) \ +do { \ + INIT_LIST_HEAD(&(_l)->pnotify_subscriber_list); \ + init_rwsem(&(_l)->pnotify_subscriber_list_sem); \ +} while(0) + +/* + * Used by task_struct to manage list of subscriber kernel modules for the + * process. Each pnotify_subscriber provides the link between the process + * and the correct kernel module subscriber. + * + * STRUCT MEMBERS: + * pnotify_events: events: Reference to pnotify_events structure, which + * holds the name key and function pointers. + * data: Opaque data pointer - defined by pnotify kernel modules. + * entry: List pointers + */ +struct pnotify_subscriber { + struct pnotify_events *events; + void *data; + struct list_head entry; +}; + +/* + * Used by pnotify modules to define the callback functions into the + * module. See Documentation/pnotify.txt for details. + * + * STRUCT MEMBERS: + * name: The name of the pnotify container type provided by + * the module. This will be set by the pnotify module. + * fork: Function pointer to function used when associating + * a forked process with a kernel module referenced by + * this struct. pnotify.txt will provide details on + * special return codes interpreted by pnotify. + * + * exit: Function pointer to function used when a process + * associated with the kernel module owning this struct + * exits. + * + * init: Function pointer to initialization function. This + * function is used when the module registers with pnotify + * to associate existing processes with the referring + * kernel module. This is optional and may be set to NULL + * if it is not needed by the pnotify kernel module. + * + * Note: The return values are managed the same way as in + * attach above. Except, of course, an error doesn't + * result in a fork failure. + * + * Note: The implementation of pnotify_register causes + * us to evaluate some tasks more than once in some cases. + * See the comments in pnotify_register for why. + * Therefore, if the init function pointer returns + * PNOTIFY_NOSUB, which means that it doesn't want this + * process associated with the kernel module, that init + * function must be prepared to possibly look at the same + * "skipped" task more than once. + * + * data: Opaque data pointer - defined by pnotify modules. + * module: Pointer to kernel module struct. Used to increment & + * decrement the use count for the module. + * entry: List pointers + * exec: Function pointer to function used when a process + * this kernel module is subscribed to execs. This + * is optional and may be set to NULL if it is not + * needed by the pnotify module. + * refcnt: Keep track of user count of pnotify_events + */ +struct pnotify_events { + struct module *module; + char *name; /* Name Key - restricted to 32 chars */ + void *data; /* Opaque module specific data */ + struct list_head entry; /* List pointers */ + atomic_t refcnt; /* usage counter */ + int (*init)(struct task_struct *, struct pnotify_subscriber *); + int (*fork)(struct task_struct *, struct pnotify_subscriber *, void*); + void (*exit)(struct task_struct *, struct pnotify_subscriber *); + void (*exec)(struct task_struct *, struct pnotify_subscriber *); +}; + + +/* Kernel service functions for providing pnotify support */ +extern struct pnotify_subscriber *pnotify_get_subscriber(struct task_struct + *task, char *key); +extern struct pnotify_subscriber *pnotify_subscribe(struct task_struct *task, + struct pnotify_events *pt); +extern void pnotify_unsubscribe(struct pnotify_subscriber *subscriber); +extern int pnotify_register(struct pnotify_events *pt_new); +extern int pnotify_unregister(struct pnotify_events *pt_old); +extern int __pnotify_fork(struct task_struct *to_task, + struct task_struct *from_task); +extern void __pnotify_exit(struct task_struct *task); +extern int __pnotify_exec(struct task_struct *task); + +/** + * pnotify_fork - child inherits subscriber list associations of its parent + * @child: child task - to inherit + * @parent: parenet task - child inherits subscriber list from this parent + * + * function used when a child process must inherit subscriber list assocation + * from the parent. Return code is propagated as a fork fail. + * + */ +static inline int pnotify_fork(struct task_struct *child, + struct task_struct *parent) +{ + INIT_PNOTIFY_LIST(child); + if (!list_empty(&parent->pnotify_subscriber_list)) + return __pnotify_fork(child, parent); + + return 0; +} + + +/** + * pnotify_exit - Detach subscriber kernel modules from this process + * @task: The task the subscribers will be detached from + * + */ +static inline void pnotify_exit(struct task_struct *task) +{ + if (!list_empty(&task->pnotify_subscriber_list)) + __pnotify_exit(task); +} + +/** + * pnotify_exec - Used when a process exec's + * @task: The process doing the exec + * + */ +static inline void pnotify_exec(struct task_struct *task) +{ + if (!list_empty(&task->pnotify_subscriber_list)) + __pnotify_exec(task); +} + +/** + * INIT_TASK_PNOTIFY - Used in INIT_TASK to set head and sem of subscriber list + * @tsk: The task work with + * + * Marco Used in INIT_TASK to set the head and sem of pnotify_subscriber_list + * If CONFIG_PNOTIFY is off, it is defined as an empty macro below. + * + */ +#define INIT_TASK_PNOTIFY(tsk) \ + .pnotify_subscriber_list = LIST_HEAD_INIT(tsk.pnotify_subscriber_list),\ + .pnotify_subscriber_list_sem = \ + __RWSEM_INITIALIZER(tsk.pnotify_subscriber_list_sem), + +#else /* CONFIG_PNOTIFY */ + +/* + * Replacement macros used when pnotify (Process Notification) support is not + * compiled into the kernel. + */ +#define INIT_TASK_PNOTIFY(tsk) +#define INIT_PNOTIFY_LIST(l) do { } while(0) +#define pnotify_fork(ct, pt) ({ 0; }) +#define pnotify_exit(t) do { } while(0) +#define pnotify_exec(t) do { } while(0) +#define pnotify_unsubscribe(t) do { } while(0) + +#endif /* CONFIG_PNOTIFY */ + +#endif /* _LINUX_NOTIFY_H */ Index: linux/Documentation/pnotify.txt =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux/Documentation/pnotify.txt 2005-09-30 14:57:57.655548722 -0500 @@ -0,0 +1,368 @@ +Process Notification (pnotify) +-------------------- +pnotify provides a method (service) for kernel modules to be notified when +certain events happen in the life of a process. Events we support include +fork, exit, and exec. A special init event is also supported (see events +below). More events could be added. pnotify also provides a generic data +pointer for the modules to work with so that data can be associated per +process. + +A kernel module will register (pnotify_register) a service request describing +events it cares about (pnotify_events) with pnotify_register. The request +tells pnotify which notifications the kernel module wants. The kernel module +passes along function pointers to be called for these events (exit, fork, exec) +in the pnotify_events service request. + +From the process point of view, each process has a kernel module subscriber +list (pnotify_subscriber_list). These kernel modules are the ones who want +notification about the life of the process. As described above, each kernel +module subscriber on the list has a generic data pointer to point to data +associated with the process. + +In the case of fork, pnotify will allocate the same kernel module subscriber +list for the new child that existed for the parent. The kernel module's +function pointer for fork is also called for the child being constructed so +the kernel module can do what ever it needs to do when a parent forks this +child. Special return values apply for the fork and init event that don't to +others. They are described in the fork and init example below. + +For exit, similar things happen but the exit function pointer for each +kernel module subscriber is called and the kernel module subscriber entry for +that process is deleted. + + +Events +------ +Events are stages of a processes life that kernel modules care about. The +fork event is triggered in a certain location in copy_process when a parent +forks. The exit event happens when a process is going away. We also support +an exec event, which happens when a process execs. Finally, there is an init +event. This special event makes it so this kernel module will be associated +with all current processes in the system at the time of registration. This is +used when a kernel module wants to keep track of all current processes as +opposed to just those it associates by itself (and children that follow). The +events a kernel module cares about are set up in the pnotify_events +structure - see usage below. + +When setting up a pnotify_events, you designate which events you care about +by either associating NULL (meaning you don't care about that event) or a +pointer to the function to run when the event is triggered. The fork event +and the exit event is currently required. + + +How do processes become associated with kernel modules? +------------------------------------------------------- +Your kernel module itself can use the pnotify_subscribe function to associate +a given process with a given pnotify_events structure. This adds +your kernel module to the subscriber list of the process. In the case +of inescapable job containers making use of PAM, when PAM allows a person to +log in, PAM contacts job (via a PAM job module which uses the job userland +library) and the kernel Job code will call pnotify_subscribe to associate the +process with pnotify. From that point on, the kernel module will be notified +about events in the process's life that the module cares about (as well, +as any children that process may later have). + +Likewise, your kernel module can remove an association between it and +a given process by using pnotify_unsubscribe. + + +Example Usage +------------- + +=== filling out the pnotify_events structure === + +A kernel module wishing to use pnotify needs to set up a pnotify_events +structure. This structure tells pnotify which events you care about and what +functions to call when those events are triggered. In addition, you supply a +name (usually the kernel module name). The entry is always filled out as +shown below. .module is usually set to THIS_MODULE. data can be optionally +used to store a pointer with the pnotify_events structure. + +Example of a filled out pnotify_events: + +static struct pnotify_events pnotify_events = { + .module = THIS_MODULE, + .name = "test_module", + .data = NULL, + .entry = LIST_HEAD_INIT(pnotify_events.entry), + .init = test_init, + .fork = test_attach, + .exit = test_detach, + .exec = test_exec, +}; + +The above pnotify_events structure says the kernel module "test_module" cares +about events fork, exit, exec, and init. In fork, call the kernel module's +test_attach function. In exec, call test_exec. In exit, call test_detach. +The init event is specified, so all processes on the system will be associated +with this kernel module during registration and the test_init function will +be run for each. + + +=== Registering with pnotify === + +You will likely register with pnotify in your kernel module's module_init +function. Here is an example: + +static int __init test_module_init(void) +{ + int rc = pnotify_register(&pnotify_events); + if (rc < 0) { + return -1; + } + + return 0; +} + + +=== Example init event function ==== + +Since the init event is defined, it means this kernel module is added +to the subscriber list of all processes -- it will receive notification +about events it cares about for all processes and all children that +follow. + +Of course, if a kernel module doesn't need to know about all current +processes, that module shouldn't implement this and '.init' in the +pnotify_events structure would be NULL. + +This is as opposed to the normal method where the kernel module adds itself +to the subscriber list of a process using pnotify_subscribe. + +Important: +Note: The implementation of pnotify_register causes us to evaluate some tasks +more than once in some cases. See the comments in pnotify_register for why. +Therefore, if the init function pointer returns PNOTIFY_NOSUB, which means +that it doesn't want a process association, that init function must be +prepared to possibly look at the same "skipped" task more than once. + +Note that the return value here is similar to the fork function pointer +below except there is no notion of failing the fork since existing processes +aren't forking. + +PNOTIFY_OK - good, adds the kernel module to the subscriber list for process +PNOTIFY_NOSUB - good, but don't add kernel module to subscriber list for process + +static int test_init(struct task_struct *tsk, struct pnotify_subscriber *subscriber) +{ + if (pnotify_get_subscriber(tsk, "test_module") == NULL) + dprintk("ERROR pnotify expected \"%s\" PID = %d\n", "test_module", tsk->pid); + + dprintk("FYI pnotify init hook fired for PID = %d\n", tsk->pid); + atomic_inc(&init_count); + return 0; +} + + +=== Example fork (test_attach) function === + +This function is executed when a process forks - this is associated +with the pnotify_callout callout in copy_process. There would be a very +similar test_detach function (not shown). + +pnotify will add the kernel module to the notification list for the child +process automatically and then execute this fork function pointer (test_attach +in this example). However, the kernel module can control whether the kernel +module stays on the process's subscriber list and wants notification by the +return value. + +PNOTIFY_ERROR - prevent the process from continuing - failing the fork +PNOTIFY_OK - good, adds the kernel module to the subscriber list for process +PNOTIFY_NOSUB - good, but don't add kernel module to subscriber list for process + + +static int test_attach(struct task_struct *tsk, struct pnotify_subscriber *subscriber, void *vp) +{ + dprintk("pnotify attach hook fired for PID = %d\n", tsk->pid); + atomic_inc(&attach_count); + + return PNOTIFY_OK; +} + + +=== Example exec event function === + +And here is an example function to run when a task gets to exec. So any +time a "tracked" process gets to exec, this would execute. + +static void test_exec(struct task_struct *tsk, struct pnotify_subscriber *subscriber) +{ + dprintk("pnotify exec hook fired for PID %d\n", tsk->pid); + atomic_inc(&exec_count); +} + + +=== Unregistering with pnotify === + +You will likely wish to unregister with pnotify in the kernel module's +module_exit function. Here is an example: + +static void __exit test_module_cleanup(void) +{ + pnotify_unregister(&pnotify_events); + printk("detach called %d times...\n", atomic_read(&detach_count)); + printk("attach called %d times...\n", atomic_read(&attach_count)); + printk("init called %d times...\n", atomic_read(&init_count)); + printk("exec called %d times ...\n", atomic_read(&exec_count)); + if (atomic_read(&attach_count) + atomic_read(&init_count) != + atomic_read(&detach_count)) + printk("pnotify PROBLEM: attach count + init count SHOULD equal detach cound and doesn't\n"); + else + printk("Good - attach count + init count equals detach count.\n"); +} + + + +=== Actually using data associated with the process in your module === + +The above examples show you how to create an example kernel module using +pnotify, but they didn't show what you might do with the data pointer +associated with a given process. Below, find an example of accessing +the data pointer for a given process from within a kernel module making use +of pnotify. + +pnotify_get_subscriber is used to retrieve the pnotify subscriber for a given +process and kernel module. Like this: + +subscriber = pnotify_get_subscriber(task, name); + +Where name is your kernel module's name (as provided in the pnotify_events +structure) and task is the process you're interested +in. + +Please be careful about locking. The task structure has a +pnotify_subscriber_list_sem to be used for locking. This example retrieves +a given task in a way that ensures it doesn't disappear while we try to +access it (that's why we do locking for the tasklist_lock and task). The +pnotify subscriber list is locked to ensure the list doesn't change as we +search it with pnotify_get_subscriber. + + read_lock(&tasklist_lock); + get_task_struct(task); /* Ensure the task doesn't vanish on us */ + read_unlock(&tasklist_lock); /* Unlock the tasklist */ + down_read(&task->pnotify_subscriber_list_sem); /* readlock subscriber list */ + + subscriber = pnotify_get_subscriber(task, name); + if (subscriber) { + /* Get the widgitId associated with this task */ + widgitId = ((widgitId_t *)subscriber->data); + } + put_task_struct(task); /* Done accessing the task */ + up_read(&task->pnotify_subscriber_list_sem); /* unlock subscriber list */ + + +Future Events +------------- +Kingsley Cheung suggested that we add events for uid and gid changes and this +may inspire broader use. Depending on how the discussoin goes, I'll post a +patch to add this functionality in the next day or two. + +History +------- +Process Notification used to be known as PAGG (Process Aggregates). +It was re-written to be called Process Notification because we believe this +better describes its purpose. Structures and functions were re-named to +be more clear and to reflect the new name. + + +Why Not Notifier Lists? +----------------------- +We investigated the use of notifier lists, available in newer kernels. + +Notifier lists would not be as efficient as pnotify for kernel modules +wishing to associate data with processes. With pnotify, if the +pnotify_subscriber_list of a given task is NULL, we can instantly know +there are no kernel modules that care about the process. Further, the +callbacks happen in places were the task struct is likely to be cached. +So this is a quick operation. With notifier lists, the scope is system +wide rather than per process. As long as one kernel module wants to be +notified, we have to walk the notifier list and potentially waste cycles. +In the case of pnotify, we only walk lists if we're interested about +a specific task. + +On a system where pnotify is used to track only a few processes, the +overhead of walking the notifier list is high compared to the overhead +of walking the kernel module subscriber list only when a kernel module +is interested in a given process. + +I don't believe this is easily solved in notifier lists themselves as +they are meant to be global resources, not per-task resources. + +Overlooking performance issues, notifier lists in and of themselves wouldn't +solve the problem pnotify solves anyway. Although you could argue notifier +lists can implement the callback portion of pnotify, there is no association +of data with a given process. This is a needed for kernel modules to +efficiently associate a task with a data pointer without cluttering up +the task struct. + +In addition to data associated with a process, we desire the ability for +kernel modules to add themselves to the subscriber list for any arbitrary +process - not just current or a child of current. + + +Some Justification +------------------ +We feel that pnotify could be used to reduce the size of the task struct or +the number of functions in copy_process. For example, if another part of the +kernel needs to know when a process is forking or exiting, they could use +pnotify instead of adding additional code to task struct, copy_process, or +exit. + +Some have argued that PAGG in the past shouldn't be used because it will +allow interesting things to be implemented outside of the kernel. While this +might be a small risk, having these in place allows customers and users to +implement kernel components that you don't want to see in the kernel anyway. + +For example, a certain vendor may have an urgent need to implement kernel +functionality or special types of accounting that nobody else is interested +in. That doesn't mean the code isn't open-source, it just means it isn't +applicable to all of Linux because it satisfies a niche. + +All of pnotify's functionality that needs to be exported is exported with +EXPORT_SYMBOL_GPL to discourage abuse. + +The risk already exists in the kernel for people to implement modules outside +the kernel that suffer from less peer review and possibly bad programming +practice. pnotify could add more oppurtunities for out-of-tree kernel module +authors to make new modules. I believe this is somewhat mitigated by the +already-existing 'tainted' warnings in the kernel. + +Other Ideas? +------------ +There have been similar proposals to provide pieces of the pnotify +functionality. If there is a better proposal out there, let's explore it. +Here are some key functions I hope to see in any proposal: + + - Ability to have notification for exec, fork, exit at minimum + - Ability to extend to other callouts later (such as uid/gid changes as + I described earlier) + - Ability for pnotify user modules to implement code that ends up adding + a kernel module subscriber to any arbitrary process (not just current and + its children). + +I believe, if the above are more or less met, we should be in good shape for +our other open source projects such as linux job. + +Variable Name Changes from PAGG to pnotify +------------------------------------------ +PAGG_NAMELEN -> PNOTIFY_NAMELEN +struct pagg -> pnotify_subscriber +pagg_get -> pnotify_get_subscriber +pagg_alloc -> pnotify_subscribe +pagg_free -> pnotify_unsubscribe +pagg_hook_register -> pnotify_register +pagg_hook_unregister -> pnotify_unregister +pagg_attach -> pnotify_fork +pagg_detach -> pnotify_exit +pagg_exec -> pnotify_exec +struct pagg_hook -> pnotify_events + +With pnotify_events (formerly pagg_hook): + attach -> fork + detach -> exit + +Return codes for the init and fork function pointers should use: +PNOTIFY_ERROR - prevent the process from continuing - failing the fork +PNOTIFY_OK - good, adds the kernel module to the subscriber list for process +PNOTIFY_NOSUB - good, but don't add kernel module to subscriber list for process + -- Erik Jacobson - Linux System Software - Silicon Graphics - Eagan, Minnesota |
From: Erik J. <er...@sg...> - 2005-09-22 21:18:38
|
> Is making pnotify leaner an option for PAGG ? something as simple as > what is in the patch http://marc.theaimsgroup.com/?l=linux- > kernel&m=111532025203086&w=2 > > Note that we do not need all the events listed there anymore, only fork > and exit. I just took a look at this patch. Forgive me if I missed something. I find I have to use things like this for real before I catch every detail and I didn't go that far with this before writing. One thing about the pnotify patch is there are lots of comments, could that be making it seem larger than it really is? There are a couple added things in pnotify that make it a bit different. First, of course, it's more generic. More importantly, pnotify has a task-associated data pointer. This allows subscriber kernel modules to quickly locate data associated with the task. Without this, some pnotify users (including Job as an example, but others as well) would need to implement their own hash table lookup mechanism to associate task data with a given task. pnotify does have subscriber inheritance built in at the fork callout (children are by default given the same subscriber list as the parent). This could be implemented by each subscriber kernel module that needs it instead I suppose. One other thing pnotify has is a way to associate all tasks currently running with a given kernel module subscriber at pnotify registration time. This is a feature I think is very useful. I suppose this could be done in the subscriber kernel module itself for modules that wish to use that functionality - so that might reduce the patch size some if people aren't interested. Finally, for module subscribers I'm familiar with, I need a way to associate a kernel module with any possible task (not necessarily just current or in-construction children). This is because sometimes kernel modules need to be notified about a task that isn't the the current task at the time. In the case of Job, perhaps batch scheduler or similar wants to track a certain process separately from the others for some administrative reason. If I understand your patch right, we go in to searching for callouts for each task, even if nothing cares about a specific task. This would still be quick I suppose if nobody registered for events. However, the pnotify solution is per-task, not global. If the task's subscriber list is null, we're done (except, for fork, we do have to set up the the list head and semaphore for new tasks). ie: +static inline int pnotify_fork(struct task_struct *child, + struct task_struct *parent) +{ + INIT_PNOTIFY_LIST(child); + if (!list_empty(&parent->pnotify_subscriber_list)) + return __pnotify_fork(child, parent); + + return 0; In other words, to be quick with the ckrm patch, there has to be no registered events. For pnotify, a kernel module subscriber may not care about all tasks. The performance hit is reduced because we don't take it globally when someone is registered - we take it per-task when a kernel module subscriber is registered to a given task. The size of the subscriber list per task may be different per task too. For example, one task may be part of a Linux job and be part of an Array session. Another task may just be part of a Linux job and nothing else. Finally, there could be tasks that have no subscriber list at all. I think the above reasons explain why pnotify is a bit heavier than this ckrm patch. I think the key pieces I'd hope for in an implementation are: * A means to subscribe arbitrary kernel modules to any given task (not just the current task or child-in-construction at the moment). * A task-associated data pointer that points to task associated data a kernel module subscriber cares about * notification that is task based - so if the kernel module subscriber list is null or small, we don't take a performance hit. * events fork, exec, exit I think the rest of what is in pnotify is very useful. But if it came to it and the community wanted to chop bits out besides the above, it would at least still be possible to do what is needed efficiently. The penalty would be some code duplication in subscriber modules and increased code size in subscriber modules in some cases. Erik |
From: Chandra S. <sek...@us...> - 2005-09-23 02:12:36
|
On Thu, 2005-09-22 at 16:18 -0500, Erik Jacobson wrote: > > Is making pnotify leaner an option for PAGG ? something as simple as > > what is in the patch http://marc.theaimsgroup.com/?l=linux- > > kernel&m=111532025203086&w=2 > > > > Note that we do not need all the events listed there anymore, only fork > > and exit. Erik, Thanks for your time and effort. > > I just took a look at this patch. Forgive me if I missed something. I > find I have to use things like this for real before I catch every detail > and I didn't go that far with this before writing. This is what it does(your analysis is right): - has a list of independent events(like fork, exit etc.,) that you can register for - one can register a callback for any particular event. - callback is called when _any_ process goes thru that event. - callback information is not stored in the task data structure. > > > One thing about the pnotify patch is there are lots of comments, could that > be making it seem larger than it really is? :). What i meant by heavy weight is not the code size, but what is being done in the whole process. BTW, your documentation is useful. My comments were in the direction of getting the least common feature set that will be used/needed by you/CKRM/others. There are some things in your implementation that may not be needed by others. for example, CKRM does not need propagating the events data structure from parent to child. It also doesn't have set of tasks that it cares about, it cares about all tasks. What we need is event notification for all tasks in the system. I do not know whether CHOS, which Shane described about, needs just the notification of event or what pnotify provides. > > There are a couple added things in pnotify that make it a bit different. > First, of course, it's more generic. More importantly, pnotify has a > task-associated data pointer. This allows subscriber kernel modules to > quickly locate data associated with the task. Without this, some pnotify > users (including Job as an example, but others as well) would need to > implement their own hash table lookup mechanism to associate task data > with a given task. > Yes, this helps if you have only one field one's module wants to associate with the task. If you have more than one, you have a define a new data structure, which might mean an additional level of indirection (we have gotten review comments regarding these multiple levels of indirections in hot path like fork). But, i agree that it reduces clutter in the task struct. Also, not everybody that would use this feature would need room for data. > pnotify does have subscriber inheritance built in at the fork callout > (children are by default given the same subscriber list as the parent). > This could be implemented by each subscriber kernel module that needs > it instead I suppose. Yes, that would help. > > One other thing pnotify has is a way to associate all tasks currently running > with a given kernel module subscriber at pnotify registration time. This > is a feature I think is very useful. I suppose this could be done in the > subscriber kernel module itself for modules that wish to use that > functionality - so that might reduce the patch size some if people aren't > interested. Yes. > > Finally, for module subscribers I'm familiar with, I need a way to associate > a kernel module with any possible task (not necessarily just current or > in-construction children). This is because sometimes kernel modules need > to be notified about a task that isn't the the current task at the time. In > the case of Job, perhaps batch scheduler or similar wants to track a certain > process separately from the others for some administrative reason. I do not understand this requirement. > > If I understand your patch right, we go in to searching for callouts > for each task, even if nothing cares about a specific task. This would > still be quick I suppose if nobody registered for events. However, the yes, you are right. That is because CKRM manages all tasks not a set of tasks as Linux job. > pnotify solution is per-task, not global. If the task's subscriber list > is null, we're done (except, for fork, we do have to set up the the list > head and semaphore for new tasks). ie: > > +static inline int pnotify_fork(struct task_struct *child, > + struct task_struct *parent) > +{ > + INIT_PNOTIFY_LIST(child); > + if (!list_empty(&parent->pnotify_subscriber_list)) > + return __pnotify_fork(child, parent); > + > + return 0; > > In other words, to be quick with the ckrm patch, there has to be no > registered events. For pnotify, a kernel module subscriber may not care > about all tasks. The performance hit is reduced because we don't take it > globally when someone is registered - we take it per-task when a > kernel module subscriber is registered to a given task. The size of the > subscriber list per task may be different per task too. For example, one > task may be part of a Linux job and be part of an Array session. Another > task may just be part of a Linux job and nothing else. Finally, there could > be tasks that have no subscriber list at all. > yes, that is different for CKRM. > I think the above reasons explain why pnotify is a bit heavier than i totally agree. > this ckrm patch. I think the key pieces I'd hope for in an implementation > are: > > * A means to subscribe arbitrary kernel modules to any given task (not > just the current task or child-in-construction at the moment). > * A task-associated data pointer that points to task associated data a > kernel module subscriber cares about > * notification that is task based - so if the kernel module subscriber > list is null or small, we don't take a performance hit. > * events fork, exec, exit let me put our laundry list :) * events fork and exit * global callback function called for every task on the above events. * 2 fields in the task data structure (We could use the data pointer as defined in your data structure, but as i pointed i am concerned about multiple indirections). (Shailabh, do you see anything else that i overlooked ?) > > I think the rest of what is in pnotify is very useful. But if it came to it > and the community wanted to chop bits out besides the above, it would at > least still be possible to do what is needed efficiently. The penalty would > be some code duplication in subscriber modules and increased code size in > subscriber modules in some cases. > > Erik > -- ---------------------------------------------------------------------- Chandra Seetharaman | Be careful what you choose.... - sek...@us... | .......you may get it. ---------------------------------------------------------------------- |
From: Jay L. <jl...@en...> - 2005-09-29 20:04:35
|
Jay Lan wrote: >Jay Lan wrote: > >>Andrew Morton wrote: >> >>> Yes, I don't think earlier versions of cbus/connector were wholly >>>race-free. Testing would need to be redone on the in-kernel >>>version. If >>>netlink itself is not doing internal hard-coded-GFP_ATOMIC allocations I >>>_think_ the whole thing should be reliable. >>> >>>If not, we need to work out where the messages went... >>> >>Agreed. I will rerun my test later tis week. >> >Guillaume, > >I could not find your fork connector patch in fork.c in 2.6.14-rc2 and >2.6.14-rc2-mm1? >My test assumes your stuff being there... > Mm, i saw your updates on ELSA web page... Too bad your patch was removed from -mm tree. Your new "fork_advisor" does not use connector, so my testing will not make any sense at all, because i like to demonistrate that a connector is not reliable... Maybe we should join force rooting for Erik now :) - jay BTW, can you email me your last fork_connector patch? Thanks! >And my fclisten failed in bind... Something must have changed... >I will sort it out. > >Thanks, > - jay > > >>Thanks, >> - jay >> >> > > >------------------------------------------------------- >This SF.Net email is sponsored by: >Power Architecture Resource Center: Free content, downloads, discussions, >and more. http://solutions.newsforge.com/ibmarch.tmpl >_______________________________________________ >Lse-tech mailing list >Lse...@li... >https://lists.sourceforge.net/lists/listinfo/lse-tech > |
From: Guillaume T. <gui...@bu...> - 2005-09-30 06:38:19
|
On Thu, 29 Sep 2005 13:03:59 -0700 Jay Lan <jl...@en...> wrote: > Mm, i saw your updates on ELSA web page... > Too bad your patch was removed from -mm tree. > > Your new "fork_advisor" does not use connector, so my testing will not > make any sense at all, because i like to demonistrate that a connector > is not > reliable... The new "fork_advisor" is just a temporary solution. In future version I will use the "Process Events Connector" proposed by Matthew Helsley that are based on connector. > BTW, can you email me your last fork_connector patch? Yes I will do that. Best regards, Guillaume |
From: Matt H. <mat...@us...> - 2005-10-02 22:41:12
|
On Fri, 2005-09-30 at 08:38 +0200, Guillaume Thouvenin wrote: > On Thu, 29 Sep 2005 13:03:59 -0700 > Jay Lan <jl...@en...> wrote: > > > Mm, i saw your updates on ELSA web page... > > Too bad your patch was removed from -mm tree. > > > > Your new "fork_advisor" does not use connector, so my testing will not > > make any sense at all, because i like to demonistrate that a connector > > is not > > reliable... > > The new "fork_advisor" is just a temporary solution. In future version I > will use the "Process Events Connector" proposed by Matthew Helsley that > are based on connector. > > > BTW, can you email me your last fork_connector patch? > > Yes I will do that. > > Best regards, > Guillaume Jay, If you're interested in the "Process Events Connector" patch I sent out the RFC patch for it to LKML very recently: http://marc.theaimsgroup.com/?l=linux-kernel&m=112796215321612&w=2 If you have any questions or feedback please don't hesitate to send me an email. Cheers, -Matt Helsley |
From: Jack S. <st...@sg...> - 2005-10-07 14:26:23
|
Here is an alternate proposal for a PAGG / pnotifier mechanism. This mechanism uses a new "task notifier" that is task specific. Unlike some of the other callout mechanisms, callouts are attached to specific tasks. There is no global callout list. Task notifiers are optional callouts that can be registered on a per-task basic. The task_notifier mechanism does not provide all of the capabilities of PAGG. Specifically, a task cannot register a task_notifier to an arbitrary task. Registration is supported only against 1) the current task, or 2) a child task during a clone. There is no support for adding notifiers to an arbitrary task. Interesting points: - no locks - no global data maintained by the notifier mechanism - 1 void* pointer added to task struct - arbitrary number of callouts - support for callout priority - each callout can have private data (ex., embed task_notifier in larger structure) but general callout mechanism is unaware of the data Loadable modules need to reference-count being in a notifier list & prevent unloading if the reference-count is non-zero. Modules that use task_notifiers are not visible to anything in the kernel if no task_notifiers are currently registered. We have a couple of modules that use the current PAGG callout mechanism. I converted one of these users (dplace) to use the new mechanism. The conversion effort was trivial. We have additional modules that use PAGG. These have not yet been analyzed to see if task_notifiers are sufficient but it looks promising. I'm proposing this patch to see if other users of pagg / pnotify could use this mechanism instead. The mechanism is lightweight, small & unintrusive. Does this look like something that would be of general use. Here is a trial patch against a SUSE kernel that also has the PAGG patch applied. I added callouts in fork(), exec(), & exit() at the same point that PAGG added callouts. Task_notifier callouts can be added in other places as needed. fs/exec.c | 2 include/linux/init_task.h | 1 include/linux/sched.h | 2 include/linux/task_notifier.h | 56 ++++++++++++++++++++ kernel/Makefile | 2 kernel/exit.c | 2 kernel/fork.c | 4 + kernel/task_notifier.c | 113 ++++++++++++++++++++++++++++++++++++++++++ Index: linux/include/linux/task_notifier.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux/include/linux/task_notifier.h 2005-10-05 13:06:33.994660706 -0500 @@ -0,0 +1,55 @@ +/* + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file "COPYING" in the main directory of this archive + * for more details. + * + * Copyright (C) 2005 Silicon Graphics, Inc. All rights reserved. + * + * + * Routines to manage task notifier chains for passing task state changes to any + * interested routines. + * + */ + +#ifndef _LINUX_TASK_NOTIFIER_H +#define _LINUX_TASK_NOTIFIER_H +#include <linux/compiler.h> +#include <linux/types.h> +#include <linux/sched.h> +#include <asm/current.h> + +struct task_notifier_block; + +typedef void (task_notifier_func)(unsigned int, void *, struct task_notifier_block *nb); + +struct task_notifier_block +{ + task_notifier_func *task_notifier_func; + struct task_notifier_block *next; + int priority; + unsigned int id_select; +}; + + +extern void task_notifier_chain_register(struct task_notifier_block *nb); +extern void task_notifier_chain_register_child(struct task_notifier_block *nb, struct task_struct *p); +extern void task_notifier_chain_unregister(struct task_notifier_block *nb); +extern struct task_notifier_block *find_task_notifier_block_proc(task_notifier_func *func); +extern void _task_notifier_call_chain(unsigned int id, void *arg); + +static inline void task_notifier_call_chain(unsigned int id, void *arg) +{ + if (unlikely(current->task_notifier)) + _task_notifier_call_chain(id, arg); +} + + +/* + * Notifier identifiers - used as bitmask for selections + */ + +#define TN_FORK 0x00000001 +#define TN_EXEC 0x00000002 +#define TN_EXIT 0x00000004 + +#endif /* _LINUX_TASK_NOTIFIER_H */ Index: linux/kernel/task_notifier.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux/kernel/task_notifier.c 2005-10-05 13:06:52.217418263 -0500 @@ -0,0 +1,111 @@ +/* + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file "COPYING" in the main directory of this archive + * for more details. + * + * Copyright (C) 2005 Silicon Graphics, Inc. All rights reserved. + */ + +#include <linux/config.h> +#include <linux/module.h> +#include <linux/task_notifier.h> +#include <asm/current.h> + +/** + * task_notifier_chain_register_task - Add a task_notifier to a task. + * @n: Entry to be added to the task notifier chain + * @p: Task (must be current or a child being created during fork/clone + */ +static inline void task_notifier_chain_register_task(struct task_notifier_block + *nb, struct task_struct *p) +{ + struct task_notifier_block **list = + (struct task_notifier_block **)&p->task_notifier; + + while (*list && nb->priority > (*list)->priority) + list = &((*list)->next); + nb->next = *list; + *list = nb; +} + +/** + * task_notifier_chain_register_child - Add a task_notifier to a task during fork/clone. + * @n: Entry to be added to the task notifier chain + * @p: Task (must be current or a child being created during fork/clone + */ +void task_notifier_chain_register_child(struct task_notifier_block *nb, + struct task_struct *p) +{ + task_notifier_chain_register_task(nb, p); +} + +/** + * task_notifier_chain_register - Add a task_notifier to the current task + * @n: Entry to be added to the task notifier chain + */ +void task_notifier_chain_register(struct task_notifier_block *nb) +{ + task_notifier_chain_register_task(nb, current); +} + +/** + * task_notifier_chain_unregister - Remove notifier from the current task + * @nb: Entry in notifier chain + * + */ + +void task_notifier_chain_unregister(struct task_notifier_block *nb) +{ + struct task_notifier_block **list = + (struct task_notifier_block **)¤t->task_notifier; + + while (*list && *list != nb) + list = &((*list)->next); + + if (*list == nb) + *list = nb->next; +} + +/** + * find_task_notifier_block_proc - Scan the current task's notifiers for a + * task_notifier_block with a pointer to @func + * @func: Function + * + */ + +struct task_notifier_block *find_task_notifier_block_proc(task_notifier_func * + func) +{ + struct task_notifier_block *nb = current->task_notifier; + + while (nb && nb->task_notifier_func != func) + nb = nb->next; + + return nb; +} + +/** + * task_notifier_call_chain - Call functions in a notifier chain + * @id: Callout identifier + * @arg: Argument to pass to called fumctions + * + * Calls each function in a notifier chain in turn. + * + */ + +void _task_notifier_call_chain(unsigned int id, void *arg) +{ + struct task_notifier_block *nb, *next_nb = current->task_notifier; + + while ((nb = next_nb)) { + next_nb = nb->next; + if (nb->id_select & id) + nb->task_notifier_func(id, arg, nb); + } +} + +EXPORT_SYMBOL_GPL(task_notifier_chain_register); +EXPORT_SYMBOL_GPL(task_notifier_chain_register_child); +EXPORT_SYMBOL_GPL(task_notifier_chain_unregister); +EXPORT_SYMBOL_GPL(find_task_notifier_block_proc); +EXPORT_SYMBOL_GPL(task_notifier_call_chain); Index: linux/fs/exec.c =================================================================== --- linux.orig/fs/exec.c 2005-09-17 09:25:56.990242412 -0500 +++ linux/fs/exec.c 2005-10-05 13:11:29.621874069 -0500 @@ -51,6 +51,7 @@ #include <linux/audit.h> #include <linux/trigevent_hooks.h> #include <linux/pagg.h> +#include <linux/task_notifier.h> #include <linux/acct.h> #include <asm/uaccess.h> @@ -1209,6 +1210,7 @@ int do_execve(char * filename, free_arg_pages(&bprm); pagg_exec(current); + task_notifier_call_chain(TN_EXEC, NULL); /* should file or bprm be passed??? */ /* execve success */ security_bprm_free(&bprm); Index: linux/include/linux/init_task.h =================================================================== --- linux.orig/include/linux/init_task.h 2005-09-17 09:25:22.922404438 -0500 +++ linux/include/linux/init_task.h 2005-09-17 09:45:28.637217785 -0500 @@ -115,6 +115,7 @@ extern struct group_info init_groups; .journal_info = NULL, \ .map_base = __TASK_UNMAPPED_BASE, \ .io_wait = NULL, \ + .task_notifier = NULL, \ INIT_TASK_PAGG(tsk) \ } Index: linux/include/linux/sched.h =================================================================== --- linux.orig/include/linux/sched.h 2005-09-17 09:25:56.999030585 -0500 +++ linux/include/linux/sched.h 2005-09-17 10:02:23.314663901 -0500 @@ -601,7 +601,7 @@ struct task_struct { struct list_head pagg_list; struct rw_semaphore pagg_sem; #endif - + void *task_notifier; }; static inline pid_t process_group(struct task_struct *tsk) Index: linux/kernel/exit.c =================================================================== --- linux.orig/kernel/exit.c 2005-09-17 09:25:57.001959975 -0500 +++ linux/kernel/exit.c 2005-09-17 09:59:17.263210664 -0500 @@ -31,6 +31,7 @@ #include <linux/trigevent_hooks.h> #include <linux/ltt.h> #include <linux/pagg.h> +#include <linux/task_notifier.h> #include <asm/uaccess.h> #include <asm/pgtable.h> @@ -863,6 +864,7 @@ asmlinkage NORET_TYPE void do_exit(long tsk->exit_code = code; pagg_detach(tsk); + task_notifier_call_chain(TN_EXIT, NULL); exit_notify(tsk); BUG_ON(!(current->flags & PF_DEAD)); schedule(); Index: linux/kernel/fork.c =================================================================== --- linux.orig/kernel/fork.c 2005-09-17 09:25:57.884683056 -0500 +++ linux/kernel/fork.c 2005-10-07 09:15:25.285049535 -0500 @@ -37,6 +37,7 @@ #include <linux/trigevent_hooks.h> #include <linux/cpuset.h> #include <linux/pagg.h> +#include <linux/task_notifier.h> #include <linux/acct.h> #include <linux/ckrm.h> @@ -1065,6 +1066,8 @@ struct task_struct *copy_process(unsigne * process aggregate containers as the parent process. */ pagg_attach(p, current); + p->task_notifier = NULL; + task_notifier_call_chain(TN_FORK, p); /* * Ok, make it visible to the rest of the system. @@ -1150,6 +1153,7 @@ fork_out: bad_fork_cleanup_namespace: pagg_detach(p); + task_notifier_call_chain(TN_EXIT, NULL); /* should this be a unique TN_xxx id?? */ exit_namespace(p); bad_fork_cleanup_mm: exit_mm(p); Index: linux/kernel/Makefile =================================================================== --- linux.orig/kernel/Makefile 2005-09-17 09:25:27.373125449 -0500 +++ linux/kernel/Makefile 2005-09-17 10:23:09.740024850 -0500 @@ -7,7 +7,7 @@ obj-y = sched.o fork.o exec_domain.o sysctl.o capability.o ptrace.o timer.o user.o \ signal.o sys.o kmod.o workqueue.o pid.o \ rcupdate.o intermodule.o extable.o params.o posix-timers.o \ - kthread.o ckrm/ + kthread.o ckrm/ task_notifier.o obj-$(CONFIG_FUTEX) += futex.o obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o -- Thanks Jack Steiner (st...@sg...) 651-683-5302 Principal Engineer SGI - Silicon Graphics, Inc. |
From: Erik J. <er...@sg...> - 2005-10-07 14:47:39
|
I hope to do an example Linux Job implementation to show limitations on this. However, a deadline will likely consume much of my time the next few days. When I do this, I can better understand what functionality might be lost. The fact that there are no locks implies that only 'current' or in-construction-children can have their notifier list modified. If linux Job were to be modified to support this, it would probably mean reduced functionality. Anything where you say "add this module to that other processes's notifier list" wouldn't be safe. A reduced Job may be OK with people, I'm not sure. Do other people feel this restriction is significant? Erik |
From: Jack S. <st...@sg...> - 2005-10-10 18:47:35
|
On Fri, Oct 07, 2005 at 09:47:25AM -0500, Erik Jacobson wrote: > I hope to do an example Linux Job implementation to show limitations > on this. However, a deadline will likely consume much of my time the > next few days. When I do this, I can better understand what > functionality might be lost. I know that the jattach & jdetach commands need the ability to manipulate the task_notifier/pnotify/pagg callout lists of an arbitrary task, but I'm curious whether this capability is currently used. If used, is it essential. It seems to me that it would be somewhat unusual to want to attach a task to a job when the task is already running. Maybe during startup?? > > The fact that there are no locks implies that only 'current' or > in-construction-children can have their notifier list modified. If linux Job > were to be modified to support this, it would probably mean reduced > functionality. Anything where you say "add this module to that other > processes's notifier list" wouldn't be safe. > > A reduced Job may be OK with people, I'm not sure. Do other people feel > this restriction is significant? > > Erik -- Thanks Jack Steiner (st...@sg...) 651-683-5302 Principal Engineer SGI - Silicon Graphics, Inc. |
From: Erik J. <er...@sg...> - 2005-10-10 19:07:03
|
> It seems to me that it would be somewhat unusual to want to attach a task > to a job when the task is already running. Maybe during startup?? Because Job can group otherwise unrelated processes in to a job so they can then be signaled and tracked as one. Even if we dumped that, I'm still not sure the job detach stuff would work right without locking. I think I've shown that the locks I've tried in my tests don't cost much in terms if additional time spent in fork. I ran a series of tests to show that the pnotify locking didn't cost much in terms of additional fork time. If we went with your task notifier idea, couldn't we run tests to see if straight-forward, easy to understand locks would work well before saying locks are too costly? In a different mail list, a sophisticated method for avoiding locks was proposed. I'm left, then, with a feeling that locks are frowned on for those notifer lists or subscriber lists. I haven't seen evidence that they are costing us much. This leads me to believe I must not be running tests that are satisfactory to people? Are there suggestions for other tests we should run to compare the timings? I can understand if people feel adding locks "bloat" the patch, but I'm not yet convinced that they slow down forks in and of themselves. They do open the door for people to write "slow" kernel components that are hooked in. However, that would be a problem if they had their own callouts too, right? Erik |
From: Jack S. <st...@sg...> - 2005-10-10 19:29:38
|
On Mon, Oct 10, 2005 at 02:06:59PM -0500, Erik Jacobson wrote: > > It seems to me that it would be somewhat unusual to want to attach a task > > to a job when the task is already running. Maybe during startup?? > > Because Job can group otherwise unrelated processes in to a job so they can > then be signaled and tracked as one. Yes, I understand that. But I wonder if it is a feature that is actually used. If you want to track a collection of tasks as a job, it seems like would create an application launcher, put the launcher inside a job container, then launch the application from launcher. Otherwise, it seems like it would be difficult to find all the tasks that belong to the job & do the jattach. I think runon, dplace, cpuset, etc all work this way. > > Even if we dumped that, I'm still not sure the job detach stuff would work > right without locking. > > I think I've shown that the locks I've tried in my tests don't cost much > in terms if additional time spent in fork. > > I ran a series of tests to show that the pnotify locking didn't cost much > in terms of additional fork time. If we went with your task notifier idea, > couldn't we run tests to see if straight-forward, easy to understand locks > would work well before saying locks are too costly? > > In a different mail list, a sophisticated method for avoiding locks was > proposed. I'm left, then, with a feeling that locks are frowned on for > those notifer lists or subscriber lists. I haven't seen evidence that they > are costing us much. This leads me to believe I must not be running tests > that are satisfactory to people? Are there suggestions for other tests > we should run to compare the timings? > > I can understand if people feel adding locks "bloat" the patch, but I'm not > yet convinced that they slow down forks in and of themselves. They do open > the door for people to write "slow" kernel components that are hooked in. > However, that would be a problem if they had their own callouts too, right? > > Erik -- Thanks Jack Steiner (st...@sg...) 651-683-5302 Principal Engineer SGI - Silicon Graphics, Inc. |
From: Erik J. <er...@sg...> - 2005-10-10 20:18:06
|
> Yes, I understand that. But I wonder if it is a feature that is actually used. > If you want to track a collection of tasks as a job, it seems like would The features of job you wish to remove are used by SGI itself by our opensource but not-actively-pushed array sessions kernel module. It's used by SGI's MPI implementation. Some features were added to Job based on feedback in part from customers but I can't prove they're using the features. In any case, we would still want the ability to throw away a job without waiting for processes to die. I'm not sure that's easily possible without locking that list. I can probably work around it some how. It also seems like the type of thing other kernel modules may be interested in doing. It seems like it is reasonable to believe that kernel modules that "do things with tasks" may be interested in more than just "current" or it's child. Job is, why not others? > create an application launcher, put the launcher inside a job container, > then launch the application from launcher. That assumes one specific use of job containers. What if an administrator wanted to contain all users running a specific type of process and group them to be signaled and tracked as one? That doesn't seem unreasonable and it's part of why Job has these features today. For example, the admin might want to put all the weather model simulations from all users in to a job so she can signal them or track them for accounting. I'm still interested in this question: Why are we avoiding locks? semaphore locks are simple to implement, easy to read in the code, and I've shown don't cost much in terms of fork time with pnotify. Am I wrong here? Is it because we're using more memory per task? If we're concerned about fork times and the tests I've run aren't adequate, I'd like the chance to try other tests. Erik |
From: Andrew M. <ak...@os...> - 2005-10-11 02:29:35
|
Erik Jacobson <er...@sg...> wrote: > > I'm still interested in this question: Why are we avoiding locks? General lock aversion, I guess. We already take a ton of locks on the fork/exec/exit path and another one won't kill us. These things are easily measurable and a patch submission should contain the results of that performance testing within its changelog record to ensure that this oft-expressed concern is always addressed. The additional locking should only be present if the feature which uses it is configured in, so the locking is just a cost of having the feature there. Of course, if we can cleanly borrow some other lock in a manner which makes sense (tasklist_lock?) then that's an option. |
From: Jack S. <st...@sg...> - 2005-10-11 03:04:58
|
On Mon, Oct 10, 2005 at 03:17:50PM -0500, Erik Jacobson wrote: > > create an application launcher, put the launcher inside a job container, > > then launch the application from launcher. > > That assumes one specific use of job containers. What if an administrator > wanted to contain all users running a specific type of process and group them > to be signaled and tracked as one? That doesn't seem unreasonable and it's > part of why Job has these features today. For example, the admin might want > to put all the weather model simulations from all users in to a job so she can > signal them or track them for accounting. Keeping all related tasks together seems like a good idea, but it seems easier to do it using a job launcher. Doing it after the fact by explicitly associating an already running pid to a job seems like an iterative process where you have to repeatedly deal with new tasks being created or disappearing. I'm sure it can be done - it just seems messy. > > I'm still interested in this question: Why are we avoiding locks? semaphore > locks are simple to implement, easy to read in the code, and I've > shown don't cost much in terms of fork time with pnotify. Am I wrong here? > Is it because we're using more memory per task? Locks are easy to implement, that isn't the issue. Global locks require bouncing a cache line around between cpus & are costly. In this specific case, however, I believe the lock would be in the task_struct (not global). I suspect the lock will be in a line that is already cache resident because it is shared with other variables that are already referenced during fork/exec/exit. Before adding the lock, I just want to make sure that there is a valid reason to do so. Note also, that I'm not opposed to either PAGG or pnotify. I only suggested task_notifiers as a possible alternative that is extremely lightweight. However, it does not support all the capabilities of PAGG/pnotify. The question is: are these additional capabilities required. > > If we're concerned about fork times and the tests I've run aren't adequate, > I'd like the chance to try other tests. > > Erik -- Thanks Jack Steiner (st...@sg...) 651-683-5302 Principal Engineer SGI - Silicon Graphics, Inc. |
From: Erik J. <er...@sg...> - 2005-10-11 04:04:22
|
> Keeping all related tasks together seems like a good idea, but it seems > easier to do it using a job launcher. Doing it after the fact by explicitly > associating an already running pid to a job seems like an iterative process > where you have to repeatedly deal with new tasks being created or disappearing. > I'm sure it can be done - it just seems messy. I don't think I'm expressing the uses of these commands very well I guess. The point is, these are pieces of the Job suite that has been posted to pa...@os... and LKML many times over the years. I would feel very bad if I had to reduce the Job suite because this suite is deployed on lots of systems and would affect users who have come to use the tools for years now. I don't get an email every time certain library calls are used in the Job suite (thank goodness) so there isn't an easy way for me to prove what is being used - but this is a community project and I have to assume people use the things that were implemented in the kernel patch and job library. For example, SGI is a user of some of the functionality in and of itself. It seems like people are scared by the larger footprint of the pnotify patch I was working on. I'm prepared to let it go in favor of your task notifier patch. But I do feel I need the locking ability. I'd hope that any pnotify users who need the added functionality of pnotify would speak up if this isn't acceptable. I'd of course put the patch through it's paces based on the sorts of tests the community wants to see. For example, I ran AIM and fork-wait bombs on various sized systems for the pnotify discussion earlier. I'm not interested at all in making Linux perform worse and I'm sure my employer wouldn't appreciate that either :) I won't have time to do much with this for a few days though due to some urgent projects/deadlines. Erik |