From: Maneesh S. <sma...@in...> - 2001-04-09 14:41:20
|
Scalable FD Management Using Read-Copy-Update --------------------------------------------- This patch provides a very good performance improvement in file descriptor management for SMP linux kernel on a 4-way machine with the expectation of even higher gains on higher end machines. The patch uses the read-copy-update mechanism for Linux, published earlier at the sourceforge site under Linux Scalablity Effort project. http://lse.sourceforge.net/locking/rclock.html. In SMP kernel the performance is limited due to reader-writer lock taken during various calls using files_struct. Majority being in the routine fget(). Though there is no severe contention for files->file_lock as the files_struct is a per task data structure but enough performance penalties have to be paid while even acquairing the read lock due to the bouncing lock cache line when multiple clones share the same files_struct. This was pointed out by John Hawkes in his posting to lse-tech mailing list http://marc.theaimsgroup.com/?l=lse-tech&m=98235007317770&w=2 The improvement in performance while runnig "chat" benchmark (from http://lbs.sourceforge.net/) is about 30% in average throughput. For both configurations the results are compared for base kernel(2.4.2) and base kernel with files_struct_rcu-2.4.2-0.1.patch. Profiling results were also collected by using SGI's kernprof utility and that shows a considerable decrease in amount of time spent in fget(). The "chat" benchmark was run with rooms=20 and messages=500. For each configuration, the test was run for 50 times and average of throughput result in terms of messages per second was taken. 1. 4-way PIII Xeon 700 MHz with 1MB L2 Cache and 1GB RAM ======================================================== Chat benchmark results ---------------------- Kernel version Average Throughput 2.4.2 191986 2.4.2+file_struct_rcu-2.4.2-0.1.patch 253083 Improvement = 31.8% kernprof results --------------- Kernel Version - 2.4.2 default_idle [C01071EC]: 150696 schedule [C0112EE4]: 105452 __wake_up [C0113518]: 74030 tcp_sendmsg [C020FB14]: 29201 fget [C013436C]: 16318 __generic_copy_to_user [C023A13C]: 15477 USER [C0121DF4]: 12925 tcp_recvmsg [C0210A68]: 7737 system_call [C0109150]: 7399 mcount [C023A4E4]: 5509 Kernel version - 2.4.2+files_struct_rcu-2.4.2+0.1.patch schedule [C0113174]: 101392 __wake_up [C01137D4]: 68182 default_idle [C01071EC]: 32833 tcp_sendmsg [C021ECE4]: 29318 __generic_copy_to_user [C024930C]: 15472 USER [C0122A00]: 12803 tcp_recvmsg [C021FC38]: 8170 system_call [C0109150]: 7636 mcount [C02496B4]: 6150 fget [C0134F58]: 5694 With this patch the routine fget() gets about 65% less hits. 2. 2-way PIII Xeon 700 MHz with 1MB L2 Cache and 1GB RAM ======================================================== Chat benchmark results ---------------------- Kernel version Average Throughput 2.4.2 209592 2.4.2+file_struct_rcu-2.4.2-0.1.patch 222729 Improvement = 6.2% kernprof results ----------------- Kernel Version - 2.4.2 schedule [C0112EE4]: 37583 tcp_sendmsg [C020FB14]: 23381 default_idle [C01071EC]: 20025 __wake_up [C0113518]: 16141 fget [C013436C]: 12733 USER [C0121DF4]: 12474 __generic_copy_to_user [C023A13C]: 12046 tcp_recvmsg [C0210A68]: 7678 system_call [C0109150]: 7113 mcount [C023A4E4]: 4789 Kernel version - 2.4.2+files_struct_rcu-2.4.2+0.1.patch default_idle [C01071EC]: 39086 schedule [C0113174]: 37278 tcp_sendmsg [C021ECE4]: 23148 __wake_up [C01137D4]: 15788 USER [C0122A00]: 12322 __generic_copy_to_user [C024930C]: 12138 tcp_recvmsg [C021FC38]: 7615 system_call [C0109150]: 7481 mcount [C02496B4]: 5310 fget [C0134F58]: 4858 The results with 4-way are much better than 2-way, which shows the need for this patch even more necessary for higher end SMP systems. Results for 8-way will be published soon. With this patch the updates to the files_struct, mainly file expanding the fd_array and expanding bitmaps, are done using callback mechanism. In the routine expand_fd_array() once the new array has been allocated, updated with the old entries and is linked to files_struct, a callback is registered to free the old fd_array, so that if there are any existing users present in the kernel, they will keepon using the old fd_array, and for a new user will see the expanded fd_array. Once quiescent state is reached the callback is fired and the old fd_array is freed. On similar lines the routine expand_fd_set has been modified to use the callback to free old bitmap. Known Race condition -------------------- There could be a race condition involving sys_close() and fget() if both are running for the same files_sturct on different CPUs. Though this problem has not occured so far but this is always possible theoritically and it is being worked out. The patch with solution for this will soon follow. Usage Information ----------------- The patch is built on 2.4.2 kernel. Before applying this patch, the user has to obviously apply the read-copy-update patch (rclock-2.4.2-0.1.patch) for linux, which can be obtained from http://lse.sourceforge.net/locking/rclock.html. The config options required for this patch are CONFIG_RCLOCK and CONFIG_FD_RCU. Regards, Maneesh -- Maneesh Soni <sma...@in...> IBM Linux Technology Center, IBM Software Lab, Bangalore, India ---------------------------------------patch------------------------------------ diff -urN linux-2.4.2+rc/arch/i386/config.in linux-2.4.2+rc+fs/arch/i386/config.in --- linux-2.4.2+rc/arch/i386/config.in Thu Apr 5 16:10:07 2001 +++ linux-2.4.2+rc+fs/arch/i386/config.in Fri Apr 6 12:50:16 2001 @@ -155,6 +155,9 @@ bool 'Math emulation' CONFIG_MATH_EMULATION bool 'MTRR (Memory Type Range Register) support' CONFIG_MTRR bool 'Read-Copy-Update lock support' CONFIG_RCLOCK +if [ "$CONFIG_RCLOCK" = "y" ]; then + bool 'File Descriptor Management using RCU' CONFIG_FD_RCU +fi bool 'Symmetric multi-processing support' CONFIG_SMP if [ "$CONFIG_SMP" != "y" ]; then bool 'APIC and IO-APIC support on uniprocessors' CONFIG_X86_UP_IOAPIC diff -urN linux-2.4.2+rc/drivers/char/tty_io.c linux-2.4.2+rc+fs/drivers/char/tty_io.c --- linux-2.4.2+rc/drivers/char/tty_io.c Wed Feb 14 02:45:04 2001 +++ linux-2.4.2+rc+fs/drivers/char/tty_io.c Thu Apr 5 18:04:45 2001 @@ -1841,7 +1841,9 @@ } task_lock(p); if (p->files) { +#ifndef CONFIG_FD_RCU read_lock(&p->files->file_lock); +#endif /* FIXME: p->files could change */ for (i=0; i < p->files->max_fds; i++) { filp = fcheck_files(p->files, i); @@ -1851,7 +1853,9 @@ break; } } +#ifndef CONFIG_FD_RCU read_unlock(&p->files->file_lock); +#endif } task_unlock(p); } diff -urN linux-2.4.2+rc/fs/exec.c linux-2.4.2+rc+fs/fs/exec.c --- linux-2.4.2+rc/fs/exec.c Sat Feb 10 08:33:39 2001 +++ linux-2.4.2+rc+fs/fs/exec.c Thu Apr 5 17:49:44 2001 @@ -475,7 +475,11 @@ { long j = -1; +#ifdef CONFIG_FD_RCU + spin_lock(&files->file_lock); +#else write_lock(&files->file_lock); +#endif for (;;) { unsigned long set, i; @@ -487,16 +491,28 @@ if (!set) continue; files->close_on_exec->fds_bits[j] = 0; +#ifdef CONFIG_FD_RCU + spin_unlock(&files->file_lock); +#else write_unlock(&files->file_lock); +#endif for ( ; set ; i++,set >>= 1) { if (set & 1) { sys_close(i); } } +#ifdef CONFIG_FD_RCU + spin_lock(&files->file_lock); +#else write_lock(&files->file_lock); +#endif } +#ifdef CONFIG_FD_RCU + spin_unlock(&files->file_lock); +#else write_unlock(&files->file_lock); +#endif } /* diff -urN linux-2.4.2+rc/fs/fcntl.c linux-2.4.2+rc+fs/fs/fcntl.c --- linux-2.4.2+rc/fs/fcntl.c Thu Nov 16 12:20:25 2000 +++ linux-2.4.2+rc+fs/fs/fcntl.c Thu Apr 5 18:00:41 2001 @@ -63,7 +63,11 @@ int error; int start; +#ifdef CONFIG_FD_RCU + spin_lock(&files->file_lock); +#else write_lock(&files->file_lock); +#endif repeat: /* @@ -109,7 +113,11 @@ { FD_SET(fd, files->open_fds); FD_CLR(fd, files->close_on_exec); +#ifdef CONFIG_FD_RCU + spin_unlock(&files->file_lock); +#else write_unlock(&files->file_lock); +#endif fd_install(fd, file); } @@ -125,7 +133,11 @@ return ret; out_putf: +#ifdef CONFIG_FD_RCU + spin_unlock(&files->file_lock); +#else write_unlock(&files->file_lock); +#endif fput(file); return ret; } @@ -136,7 +148,11 @@ struct file * file, *tofree; struct files_struct * files = current->files; +#ifdef CONFIG_FD_RCU + spin_lock(&files->file_lock); +#else write_lock(&files->file_lock); +#endif if (!(file = fcheck(oldfd))) goto out_unlock; err = newfd; @@ -167,7 +183,11 @@ files->fd[newfd] = file; FD_SET(newfd, files->open_fds); FD_CLR(newfd, files->close_on_exec); +#ifdef CONFIG_FD_RCU + spin_unlock(&files->file_lock); +#else write_unlock(&files->file_lock); +#endif if (tofree) filp_close(tofree, files); @@ -175,11 +195,19 @@ out: return err; out_unlock: +#ifdef CONFIG_FD_RCU + spin_unlock(&files->file_lock); +#else write_unlock(&files->file_lock); +#endif goto out; out_fput: +#ifdef CONFIG_FD_RCU + spin_unlock(&files->file_lock); +#else write_unlock(&files->file_lock); +#endif fput(file); goto out; } diff -urN linux-2.4.2+rc/fs/file.c linux-2.4.2+rc+fs/fs/file.c --- linux-2.4.2+rc/fs/file.c Sat Feb 10 00:59:44 2001 +++ linux-2.4.2+rc+fs/fs/file.c Thu Apr 5 17:39:29 2001 @@ -14,6 +14,17 @@ #include <asm/bitops.h> +#ifdef CONFIG_FD_RCU +#include <linux/rclock.h> + +struct fdset_callback_func_args { + int nfds; + fd_set *new_openset; + fd_set *new_execset; +}; +typedef struct fdset_callback_func_args fdset_callback_func_args_t; + +#endif /* * Allocate an fd array, using kmalloc or vmalloc. @@ -48,6 +59,109 @@ vfree(array); } + +#ifdef CONFIG_FD_RCU + +int fd_array_callback_func(rc_callback_t *cb, void *arg1, void *arg2) +{ + struct file **old_fds = (struct file **)arg1; + unsigned long nfds = (unsigned long ) arg2; + + free_fd_array(old_fds, nfds); + rc_free_callback(cb); + + return 0; +} + +int expand_fd_array(struct files_struct *files, int nr) +{ + struct file **new_fds; + int error, nfds; + rc_callback_t *fd_cb; + + error = -EMFILE; + if (files->max_fds >= NR_OPEN || nr >= NR_OPEN) + goto out; + + nfds = files->max_fds; + spin_unlock(&files->file_lock); + + /* + * Expand to the max in easy steps, and keep expanding it until + * we have enough for the requested fd array size. + */ + + do { +#if NR_OPEN_DEFAULT < 256 + if (nfds < 256) + nfds = 256; + else +#endif + if (nfds < (PAGE_SIZE / sizeof(struct file *))) + nfds = PAGE_SIZE / sizeof(struct file *); + else { + nfds = nfds * 2; + if (nfds > NR_OPEN) + nfds = NR_OPEN; + } + } while (nfds <= nr); + + error = -ENOMEM; + + fd_cb = rc_alloc_callback(fd_array_callback_func, NULL, NULL, + GFP_ATOMIC); + new_fds = alloc_fd_array(nfds); + spin_lock(&files->file_lock); + if (!new_fds) { + rc_free_callback(fd_cb); + goto out; + } + + /* Copy the existing array and install the new pointer */ + + if (nfds > files->max_fds) { + struct file **old_fds = files->fd; + int i = files->max_fds; + + + /* Don't copy/clear the array if we are creating a new + fd array for fork() */ + if (i) { + memcpy(new_fds, old_fds, i * sizeof(struct file *)); + /* clear the remainder of the array */ + memset(&new_fds[i], 0, + (nfds-i) * sizeof(struct file *)); + } + + RC_MEMSYNC(); + + old_fds = xchg(&files->fd, new_fds); + + RC_MEMSYNC(); + + i = xchg(&files->max_fds, nfds); + + if (i) { + fd_cb->arg1 = old_fds; + fd_cb->arg2 = (void *) i; + rc_callback(fd_cb); + } else { + rc_free_callback(fd_cb); + } + + } else { + /* Somebody expanded the array while we slept ... */ + spin_unlock(&files->file_lock); + free_fd_array(new_fds, nfds); + rc_free_callback(fd_cb); + spin_lock(&files->file_lock); + } + error = 0; +out: + return error; +} + +#else /* CONFIG_FD_RCU */ /* * Expand the fd array in the files_struct. Called with the files * spinlock held for write. @@ -123,6 +237,7 @@ out: return error; } +#endif /* CONFIG_FD_RCU */ /* * Allocate an fdset array, using kmalloc or vmalloc. @@ -157,6 +272,117 @@ vfree(array); } +#ifdef CONFIG_FD_RCU + +int fdset_callback_func(rc_callback_t *cb, void *arg1, void *arg2) +{ + int nfds = ((fdset_callback_func_args_t *)arg1)->nfds; + fd_set *new_openset = ((fdset_callback_func_args_t *)arg1)->new_openset; + fd_set *new_execset = ((fdset_callback_func_args_t *)arg1)->new_execset; + + free_fdset (new_openset, nfds); + free_fdset (new_execset, nfds); + + kfree(arg1); + rc_free_callback(cb); + + return 0; +} + +int expand_fdset(struct files_struct *files, int nr) +{ + fd_set *new_openset = 0, *new_execset = 0; + int error, nfds = 0; + rc_callback_t *fd_cb = NULL; + fdset_callback_func_args_t *args = NULL; + + error = -EMFILE; + if (files->max_fdset >= NR_OPEN || nr >= NR_OPEN) + goto out; + + nfds = files->max_fdset; + spin_unlock(&files->file_lock); + + /* Expand to the max in easy steps */ + do { + if (nfds < (PAGE_SIZE * 8)) + nfds = PAGE_SIZE * 8; + else { + nfds = nfds * 2; + if (nfds > NR_OPEN) + nfds = NR_OPEN; + } + } while (nfds <= nr); + + error = -ENOMEM; + new_openset = alloc_fdset(nfds); + new_execset = alloc_fdset(nfds); + + fd_cb = rc_alloc_callback(fdset_callback_func, NULL, NULL, GFP_ATOMIC); + args = (fdset_callback_func_args_t *)kmalloc(sizeof(args), GFP_ATOMIC); + + spin_lock(&files->file_lock); + if (!new_openset || !new_execset) + goto out; + + error = 0; + + /* Copy the existing tables and install the new pointers */ + if (nfds > files->max_fdset) { + int i = files->max_fdset / (sizeof(unsigned long) * 8); + int count = (nfds - files->max_fdset) / 8; + + /* + * Don't copy the entire array if the current fdset is + * not yet initialised. + */ + if (i) { + memcpy (new_openset, files->open_fds, files->max_fdset/8); + memcpy (new_execset, files->close_on_exec, files->max_fdset/8); + memset (&new_openset->fds_bits[i], 0, count); + memset (&new_execset->fds_bits[i], 0, count); + } + + + RC_MEMSYNC(); + + nfds = xchg(&files->max_fdset, nfds); + + RC_MEMSYNC(); + + new_openset = xchg(&files->open_fds, new_openset); + + RC_MEMSYNC(); + + new_execset = xchg(&files->close_on_exec, new_execset); + + args->nfds = nfds; + args->new_openset = new_openset; + args->new_execset = new_execset; + + fd_cb->arg1 = args; + + rc_callback(fd_cb); + + return 0; + } + /* Somebody expanded the array while we slept ... */ + +out: + spin_unlock(&files->file_lock); + if (new_openset) + free_fdset(new_openset, nfds); + if (new_execset) + free_fdset(new_execset, nfds); + if (fd_cb) + rc_free_callback(fd_cb); + if (args) + kfree(args); + spin_lock(&files->file_lock); + return error; +} + +#else /* * Expand the fdset in the files_struct. Called with the files spinlock * held for write. @@ -230,3 +456,4 @@ return error; } +#endif /* CONFIG_FD_RCU */ diff -urN linux-2.4.2+rc/fs/file_table.c linux-2.4.2+rc+fs/fs/file_table.c --- linux-2.4.2+rc/fs/file_table.c Tue Dec 5 23:57:31 2000 +++ linux-2.4.2+rc+fs/fs/file_table.c Fri Apr 6 19:25:43 2001 @@ -127,11 +127,15 @@ struct file * file; struct files_struct *files = current->files; +#ifndef CONFIG_FD_RCU read_lock(&files->file_lock); +#endif file = fcheck(fd); if (file) get_file(file); +#ifndef CONFIG_FD_RCU read_unlock(&files->file_lock); +#endif return file; } diff -urN linux-2.4.2+rc/fs/open.c linux-2.4.2+rc+fs/fs/open.c --- linux-2.4.2+rc/fs/open.c Sat Feb 10 00:59:44 2001 +++ linux-2.4.2+rc+fs/fs/open.c Thu Apr 5 17:45:22 2001 @@ -16,6 +16,9 @@ #include <linux/tty.h> #include <asm/uaccess.h> +#ifdef CONFIG_FD_RCU +#include <linux/rclock.h> +#endif #define special_file(m) (S_ISCHR(m)||S_ISBLK(m)||S_ISFIFO(m)||S_ISSOCK(m)) @@ -688,7 +691,11 @@ int fd, error; error = -EMFILE; +#ifdef CONFIG_FD_RCU + spin_lock(&files->file_lock); +#else write_lock(&files->file_lock); +#endif repeat: fd = find_next_zero_bit(files->open_fds, @@ -737,7 +744,11 @@ error = fd; out: +#ifdef CONFIG_FD_RCU + spin_unlock(&files->file_lock); +#else write_unlock(&files->file_lock); +#endif return error; } @@ -818,7 +829,11 @@ struct file * filp; struct files_struct *files = current->files; +#ifdef CONFIG_FD_RCU + spin_lock(&files->file_lock); +#else write_lock(&files->file_lock); +#endif if (fd >= files->max_fds) goto out_unlock; filp = files->fd[fd]; @@ -827,11 +842,19 @@ files->fd[fd] = NULL; FD_CLR(fd, files->close_on_exec); __put_unused_fd(files, fd); +#ifdef CONFIG_FD_RCU + spin_unlock(&files->file_lock); +#else write_unlock(&files->file_lock); +#endif return filp_close(filp, files); out_unlock: +#ifdef CONFIG_FD_RCU + spin_unlock(&files->file_lock); +#else write_unlock(&files->file_lock); +#endif return -EBADF; } diff -urN linux-2.4.2+rc/fs/proc/base.c linux-2.4.2+rc+fs/fs/proc/base.c --- linux-2.4.2+rc/fs/proc/base.c Fri Nov 17 02:48:26 2000 +++ linux-2.4.2+rc+fs/fs/proc/base.c Thu Apr 5 18:03:44 2001 @@ -739,12 +739,16 @@ task_unlock(task); if (!files) goto out_unlock; +#ifndef CONFIG_FD_RCU read_lock(&files->file_lock); +#endif file = inode->u.proc_i.file = fcheck_files(files, fd); if (!file) goto out_unlock2; get_file(file); +#ifndef CONFIG_FD_RCU read_unlock(&files->file_lock); +#endif put_files_struct(files); inode->i_op = &proc_pid_link_inode_operations; inode->i_size = 64; @@ -760,7 +764,9 @@ out_unlock2: put_files_struct(files); +#ifndef CONFIG_FD_RCU read_unlock(&files->file_lock); +#endif out_unlock: iput(inode); out: diff -urN linux-2.4.2+rc/fs/select.c linux-2.4.2+rc+fs/fs/select.c --- linux-2.4.2+rc/fs/select.c Sat Feb 10 00:59:44 2001 +++ linux-2.4.2+rc+fs/fs/select.c Thu Apr 5 17:50:20 2001 @@ -166,9 +166,13 @@ int retval, i, off; long __timeout = *timeout; +#ifndef CONFIG_FD_RCU read_lock(¤t->files->file_lock); +#endif retval = max_select_fd(n, fds); +#ifndef CONFIG_FD_RCU read_unlock(¤t->files->file_lock); +#endif if (retval < 0) return retval; diff -urN linux-2.4.2+rc/include/linux/file.h linux-2.4.2+rc+fs/include/linux/file.h --- linux-2.4.2+rc/include/linux/file.h Wed Aug 23 23:52:26 2000 +++ linux-2.4.2+rc+fs/include/linux/file.h Thu Apr 5 16:34:02 2001 @@ -12,21 +12,33 @@ { struct files_struct *files = current->files; int res; +#ifndef CONFIG_FD_RCU read_lock(&files->file_lock); +#endif res = FD_ISSET(fd, files->close_on_exec); +#ifndef CONFIG_FD_RCU read_unlock(&files->file_lock); +#endif return res; } static inline void set_close_on_exec(unsigned int fd, int flag) { struct files_struct *files = current->files; +#ifdef CONFIG_FD_RCU + spin_lock(&files->file_lock); +#else write_lock(&files->file_lock); +#endif if (flag) FD_SET(fd, files->close_on_exec); else FD_CLR(fd, files->close_on_exec); +#ifdef CONFIG_FD_RCU + spin_unlock(&files->file_lock); +#else write_unlock(&files->file_lock); +#endif } static inline struct file * fcheck_files(struct files_struct *files, unsigned int fd) @@ -66,9 +78,17 @@ { struct files_struct *files = current->files; +#ifdef CONFIG_FD_RCU + spin_lock(&files->file_lock); +#else write_lock(&files->file_lock); +#endif __put_unused_fd(files, fd); +#ifdef CONFIG_FD_RCU + spin_unlock(&files->file_lock); +#else write_unlock(&files->file_lock); +#endif } /* @@ -87,12 +107,21 @@ static inline void fd_install(unsigned int fd, struct file * file) { struct files_struct *files = current->files; - + +#ifdef CONFIG_FD_RCU + spin_lock(&files->file_lock); +#else write_lock(&files->file_lock); +#endif if (files->fd[fd]) BUG(); files->fd[fd] = file; + +#ifdef CONFIG_FD_RCU + spin_unlock(&files->file_lock); +#else write_unlock(&files->file_lock); +#endif } void put_files_struct(struct files_struct *fs); diff -urN linux-2.4.2+rc/include/linux/sched.h linux-2.4.2+rc+fs/include/linux/sched.h --- linux-2.4.2+rc/include/linux/sched.h Thu Feb 22 05:39:58 2001 +++ linux-2.4.2+rc+fs/include/linux/sched.h Fri Apr 6 14:09:18 2001 @@ -167,7 +167,11 @@ */ struct files_struct { atomic_t count; +#ifdef CONFIG_FD_RCU + spinlock_t file_lock; +#else rwlock_t file_lock; +#endif int max_fds; int max_fdset; int next_fd; @@ -179,6 +183,22 @@ struct file * fd_array[NR_OPEN_DEFAULT]; }; +#ifdef CONFIG_FD_RCU +#define INIT_FILES \ +{ \ + count: ATOMIC_INIT(1), \ + file_lock: SPIN_LOCK_UNLOCKED, \ + max_fds: NR_OPEN_DEFAULT, \ + max_fdset: __FD_SETSIZE, \ + next_fd: 0, \ + fd: &init_files.fd_array[0], \ + close_on_exec: &init_files.close_on_exec_init, \ + open_fds: &init_files.open_fds_init, \ + close_on_exec_init: { { 0, } }, \ + open_fds_init: { { 0, } }, \ + fd_array: { NULL, } \ +} +#else #define INIT_FILES \ { \ count: ATOMIC_INIT(1), \ @@ -193,6 +213,7 @@ open_fds_init: { { 0, } }, \ fd_array: { NULL, } \ } +#endif /* Maximum number of active map areas.. This is a random (large) number */ #define MAX_MAP_COUNT (65536) diff -urN linux-2.4.2+rc/kernel/fork.c linux-2.4.2+rc+fs/kernel/fork.c --- linux-2.4.2+rc/kernel/fork.c Sat Feb 10 00:59:44 2001 +++ linux-2.4.2+rc+fs/kernel/fork.c Thu Apr 5 16:22:08 2001 @@ -24,6 +24,9 @@ #include <asm/uaccess.h> #include <asm/mmu_context.h> +#ifdef CONFIG_FD_RCU +#include <linux/rclock.h> +#endif /* The idle threads do not count.. */ int nr_threads; int nr_running; @@ -433,7 +436,11 @@ atomic_set(&newf->count, 1); +#ifdef CONFIG_FD_RCU + newf->file_lock = SPIN_LOCK_UNLOCKED; +#else newf->file_lock = RW_LOCK_UNLOCKED; +#endif newf->next_fd = 0; newf->max_fds = NR_OPEN_DEFAULT; newf->max_fdset = __FD_SETSIZE; @@ -446,13 +453,23 @@ size = oldf->max_fdset; if (size > __FD_SETSIZE) { newf->max_fdset = 0; +#ifdef CONFIG_FD_RCU + spin_lock(&newf->file_lock); +#else write_lock(&newf->file_lock); +#endif error = expand_fdset(newf, size-1); +#ifdef CONFIG_FD_RCU + spin_unlock(&newf->file_lock); +#else write_unlock(&newf->file_lock); +#endif if (error) goto out_release; } +#ifndef CONFIG_FD_RCU read_lock(&oldf->file_lock); +#endif open_files = count_open_files(oldf, size); @@ -463,15 +480,27 @@ */ nfds = NR_OPEN_DEFAULT; if (open_files > nfds) { +#ifndef CONFIG_FD_RCU read_unlock(&oldf->file_lock); +#endif newf->max_fds = 0; +#ifdef CONFIG_FD_RCU + spin_lock(&newf->file_lock); +#else write_lock(&newf->file_lock); +#endif error = expand_fd_array(newf, open_files-1); +#ifdef CONFIG_FD_RCU + spin_unlock(&newf->file_lock); +#else write_unlock(&newf->file_lock); +#endif if (error) goto out_release; nfds = newf->max_fds; +#ifndef CONFIG_FD_RCU read_lock(&oldf->file_lock); +#endif } old_fds = oldf->fd; @@ -486,7 +515,9 @@ get_file(f); *new_fds++ = f; } +#ifndef CONFIG_FD_RCU read_unlock(&oldf->file_lock); +#endif /* compute the remainder to be cleared */ size = (newf->max_fds - open_files) * sizeof(struct file *); diff -urN linux-2.4.2+rc/net/ipv4/netfilter/ipt_owner.c linux-2.4.2+rc+fs/net/ipv4/netfilter/ipt_owner.c --- linux-2.4.2+rc/net/ipv4/netfilter/ipt_owner.c Mon Aug 14 01:23:19 2000 +++ linux-2.4.2+rc+fs/net/ipv4/netfilter/ipt_owner.c Thu Apr 5 18:06:21 2001 @@ -25,16 +25,22 @@ task_lock(p); files = p->files; if(files) { +#ifndef CONFIG_FD_RCU read_lock(&files->file_lock); +#endif for (i=0; i < files->max_fds; i++) { if (fcheck_files(files, i) == skb->sk->socket->file) { +#ifndef CONFIG_FD_RCU read_unlock(&files->file_lock); +#endif task_unlock(p); read_unlock(&tasklist_lock); return 1; } } +#ifndef CONFIG_FD_RCU read_unlock(&files->file_lock); +#endif } task_unlock(p); out: @@ -58,14 +64,18 @@ task_lock(p); files = p->files; if (files) { +#ifndef CONFIG_FD_RCU read_lock(&files->file_lock); +#endif for (i=0; i < files->max_fds; i++) { if (fcheck_files(files, i) == file) { found = 1; break; } } +#ifndef CONFIG_FD_RCU read_unlock(&files->file_lock); +#endif } task_unlock(p); if(found) |
From: Anton B. <an...@sa...> - 2001-04-12 01:32:01
|
> This patch provides a very good performance improvement in file > descriptor management for SMP linux kernel on a 4-way machine with the > expectation of even higher gains on higher end machines. The patch uses the > read-copy-update mechanism for Linux, published earlier at the sourceforge > site under Linux Scalablity Effort project. > http://lse.sourceforge.net/locking/rclock.html. Good stuff! It would be interesting to try a filesystem benchmark such as dbench. On a quad PPC fget was chewing up more than its fair share of cpu time. Anton |
From: Maneesh S. <sma...@in...> - 2001-04-12 15:42:35
|
On Wed, Apr 11, 2001 at 06:29:30PM -0700, Anton Blanchard wrote: > > > This patch provides a very good performance improvement in file > > descriptor management for SMP linux kernel on a 4-way machine with the > > expectation of even higher gains on higher end machines. The patch uses the > > read-copy-update mechanism for Linux, published earlier at the sourceforge > > site under Linux Scalablity Effort project. > > http://lse.sourceforge.net/locking/rclock.html. > > Good stuff! > > It would be interesting to try a filesystem benchmark such as dbench. On > a quad PPC fget was chewing up more than its fair share of cpu time. > > Anton Hello Anton, Thank you for your suggestion. I tried dbench for on a 4-way PIII Xeon box, with 1MB L2 Cache and 1GB of RAM. I ran it on 8 GB ext2 partition with Adaptec 7896 SCSI controller. I ran "dbench 100" and "dbench 200" for five times and took the average of the throughput and found that for Base (2.4.2) - 100 Average Throughput = 39.628 MB/sec 200 Average Throughput = 22.792 MB/sec Base + files_struct patch - 100 Average Throughput = 39.874 MB/sec 200 Average Throughput = 23.174 MB/sec I found this value quite less than the one present in the README distributed with dbench tarball. I think the numbers in the README were for a similar machine but with 2.2.9 kernel. As you can see the performance with files_struct patch is almost same as base. I feel I am hitting some bottleneck other than fget() in both base and the patched versions. I think atleast for base version I should get similar numbers as mentions in the README for similar configuration. Though I intend to do some profiling to look into this but it will be helpfull if you can tell me if there is some known thing regarding this. I am copying this to Andrew also, if he can also help. Also if you have some dbench numbers from 2.4.x kernel, please let me have a look into those also. Thank you, Maneesh -- Maneesh Soni <sma...@in...> IBM Linux Technology Center, IBM Software Lab, Bangalore, India |
From: Anton B. <an...@sa...> - 2001-04-12 15:53:54
|
Hi, > Base (2.4.2) - > 100 Average Throughput = 39.628 MB/sec > 200 Average Throughput = 22.792 MB/sec > > Base + files_struct patch - > 100 Average Throughput = 39.874 MB/sec > 200 Average Throughput = 23.174 MB/sec > > I found this value quite less than the one present in the README distributed > with dbench tarball. I think the numbers in the README were for a similar > machine but with 2.2.9 kernel. If you guestimate that each dbench client uses about 20M of RAM then dbench 100 has no chance of remaining in memory. Once you hit disk then spinlock optimisations are all in the noise :) Smaller runs (< 30) should see it stay in memory. Also if you turn of kupdated (so old buffers are not flushed out just because they are old) and make the VM more agressive about filling memory with dirty buffers then you will not hit the disk and then hopefully the optimisations will be more obvious. killall -STOP kupdated echo "90 64 64 256 500 3000 95 0 0" > /proc/sys/vm/bdflush Remember to killall -CONT kupdated when you are finished :) > I am copying this to Andrew also, if he can also help. Also if you have some > dbench numbers from 2.4.x kernel, please let me have a look into those also. The single CPU 333MHz POWER 3 I was playing with got 100MB/s when not touching disk. Anton |
From: Maneesh S. <sma...@in...> - 2001-04-17 10:47:15
|
On Thu, Apr 12, 2001 at 08:51:18AM -0700, Anton Blanchard wrote: > > Hi, > > > Base (2.4.2) - > > 100 Average Throughput = 39.628 MB/sec > > 200 Average Throughput = 22.792 MB/sec > > > > Base + files_struct patch - > > 100 Average Throughput = 39.874 MB/sec > > 200 Average Throughput = 23.174 MB/sec > > > > I found this value quite less than the one present in the README distributed > > with dbench tarball. I think the numbers in the README were for a similar > > machine but with 2.2.9 kernel. > > If you guestimate that each dbench client uses about 20M of RAM then dbench > 100 has no chance of remaining in memory. Once you hit disk then spinlock > optimisations are all in the noise :) Smaller runs (< 30) should see > it stay in memory. > > Also if you turn of kupdated (so old buffers are not flushed out just > because they are old) and make the VM more agressive about filling > memory with dirty buffers then you will not hit the disk and then > hopefully the optimisations will be more obvious. > > killall -STOP kupdated > echo "90 64 64 256 500 3000 95 0 0" > /proc/sys/vm/bdflush > > Remember to killall -CONT kupdated when you are finished :) > > > I am copying this to Andrew also, if he can also help. Also if you have some > > dbench numbers from 2.4.x kernel, please let me have a look into those also. > > The single CPU 333MHz POWER 3 I was playing with got 100MB/s when not > touching disk. > > Anton > - Hi Anton, I ran the dbench test as per your suggesstions. Now I get similar throughput numbers as yours. But still the throughput improvement is not there for my patch. the reason, I think, is that I didnot get too many hits to fget() routine. It will be helpful if you can tell how you got fget() chewing up more than its fair share of CPU time. For 30 clients: Base(2.4.2) - Average Throughput = 235.139 MB/S Base + Files_struct patch - Average Throughput = 235.751 MB/S I also did profiling while running these tests, using kernprof. The fget hits are as below Base(2.4.2) 304 Base + Files_struct patch 189 Though while doing kernprofile'ing my sample size is quite big (the top ranker in profiling is "default_idle" with 28471 hits). The fget's hit count is quite low compared to "default_idle". As you can also see, the files_struct patch is able to reduce the number of hits to fget by around 37% I also saw the dbench.c code. It does creates no. of child processes but with fork() and not through __clone(). I think the fget() will affect the performance in the scenarios where the childs are created using clone() with CLONE_FILES flag set. That is when many child processes share parent's files_struct and everybody tries to acquire the same files->file_lock. And in those scenarios we should see considerable performance improvement by using the files_struct patch as in the case of "chat"benchmark. Regards, Maneesh -- Maneesh Soni <sma...@in...> http://lse.sourceforge.net/locking/rclock.html IBM Linux Technology Center, IBM Software Lab, Bangalore, India |
From: Andi K. <ak...@su...> - 2001-04-17 14:36:27
|
On Tue, Apr 17, 2001 at 04:19:16PM +0530, Maneesh Soni wrote: > But still the throughput improvement is not there for my patch. the reason, I > think, is that I didnot get too many hits to fget() routine. It will be helpful > if you can tell how you got fget() chewing up more than its fair share of CPU > time. dbench uses fork(), not clone, and therefore has no shared file_structs that could be contended. -Andi |
From: John H. <ha...@en...> - 2001-04-12 18:22:37
|
From: "Maneesh Soni" <sma...@in...> ... > Though I intend to do some > profiling to look into this but it will be helpfull if you can tell me if there > is some known thing regarding this. ... Another useful tool is Lockmeter, which gives you visibility to what's happening with spinlock_t and rwlock_t usage. http://oss.sgi.com/projects/lockmeter John Hawkes ha...@en... |