From: Jun N. <ju...@sc...> - 2001-01-23 16:54:35
|
I tried to run SDET (Software Development Environment Throughput), which basically is a system level, throughput oriented benchmark, on the 2.4.0 kernel and 2.4.0 kernel with this patch. I guess many (old?) Unix guys are familiar with it, and it is (was?) sometimes used to check some aspects of scalability of the system. The details of this bechmark are not so important in this mail (available upon request). The following are very preliminary numbers from the benchmark. Tests were run on a system with 8 550 MHz Pentium III processors. I think those results are encouraging. # of Scripts Throughput Throughput 2.4 2.4-multi-queue --------- ---------- -------- 1 2057.1 1978.0 2 4114.3 4067.8 4 7700.5 7700.5 6 10746.3 10746.3 8 12973.0 12576.4 10 13186.8 13235.3 15 13138.7 13235.3 20 12996.4 13043.5 25 13005.8 13005.8 30 12811.4 13059.3 40 12676.1 12732.1 50 12121.2 12676.1 60 12314.7 12442.4 70 12051.6 11954.5 80 11871.4 11985.0 90 11608.7 11777.5 100 10849.9 11523.7 125 10678.7 10940.9 150 10416.7 10503.8 175 10187.6 10314.3 200 9749.5 10106.7 250 8343.4 8787.3 I also checked hot-spots with the 2.4.0 kernel (not with multi-queue) with lockmeter (http://oss.sgi.com/projects/lockmeter/). The data were sampled when the number of scripte is 175. SPINLOCKS HOLD WAIT UTIL CON MEAN (MAX) MEAN (MAX) TOTAL NAME ... 10.56% 26.89% 7.4us(175us) 3.4us(692us) 1569304 runqueue_lock 2.23% 29.75% 4.5us(20us) 4.4us(646us) 550505 __wake_up+0x7c 0.01% 11.62% 6.6us(15us) 1.0us(65us) 2056 __wake_up+0x128 0.00% 14.29% 0.4us(2.6us) 3.0us(332us) 1393 deliver_signal+0x58 0.00% 9.94% 7.2us(16us) 1.2us(56us) 332 process_timeout+0x14 0.01% 26.70% 4.7us(16us) 5.0us(296us) 1457 schedule_tail+0x58 7.53% 23.28% 11us(175us) 3.0us(692us) 781676 schedule+0xd0 0.66% 35.42% 3.5us(23us) 2.8us(486us) 206008 schedule+0x458 0.00% 11.79% 4.2us(78us) 1.1us(56us) 560 schedule+0x504 0.11% 9.42% 5.0us(21us) 2.3us(420us) 25317 wake_up_process+0x14 The above result basically tells utilization of runqueue_lock is about 10% of all spinlocks held during the benchmarck and nealy 27% of the requests for this lock need to spin and wait for the lock (The NAMEs below the lock are the locations where that lock is used). This might explain the throughput improvements gained by the multi-queue scheduler. Now who has the largest utilization? Of course it's kernel_flag. SPINLOCKS HOLD WAIT UTIL CON MEAN (MAX) MEAN (MAX) TOTAL NAME ... 43.15% 33.08% 13us(95971us) 12us(95997us) 3558789 kernel_flag 0.02% 38.26% 0.7us(29us) 34us(94975us) 23788 acct_process+0x1c 0.02% 44.63% 8.3us(43us) 23us(675us) 2012 chrdev_open+0x4c 0.00% 22.26% 0.9us(2.5us) 16us(525us) 283 de_put+0x28 5.26% 38.34% 244us(1184us) 21us(53127us) 23788 do_exit+0xf8 0.99% 36.22% 11us(840us) 12us(53195us) 96205 ext2_delete_inode+0x20 0.46% 29.64% 1.2us(159us) 9.1us(53249us) 430421 ext2_discard_prealloc+0x20 1.28% 40.60% 9.7us(152us) 22us(43404us) 146014 ext2_get_block+0x54 0.00% 40.00% 0.4us(0.7us) 8.6us(34us) 5 locks_remove_flock+0x34 0.00% 40.00% 0.6us(1.2us) 4.5us(14us) 5 locks_remove_posix+0x38 0.92% 40.80% 12us(572us) 16us(47804us) 84618 lookup_hash+0x84 0.16% 37.35% 1.0us(178us) 13us(53173us) 175002 notify_change+0x68 7.78% 15.00% 46us(2523us) 3.1us(27213us)188485 permission+0x38 20.34% 32.99% 12us(1981us) 12us(95997us)1927065 real_lookup+0x64 0.05% 47.31% 595us(51910us) 22us(270us) 93 schedule+0x490 0.56% 42.11% 32861us(95971us)41us(405us) 19 sync_old_buffers+0x20 0.83% 40.22% 19us(1473us) 19us(41614us) 48081 sys_fcntl64+0x44 0.01% 38.05% 1.3us(37us) 22us(49506us) 12422 sys_ioctl+0x4c 0.06% 33.12% 0.5us(62us) 15us(49778us) 132230 sys_llseek+0x88 0.00% 39.64% 0.9us(4.9us) 19us(849us) 5401 sys_lseek+0x6c 0.00% 37.50% 28us(48us) 12us(222us) 200 sys_rename+0x1a0 0.02% 42.29% 6.2us(22us) 81us(93181us) 3802 sys_sysctl+0x4c 0.00% 52.27% 6.4us(29us) 13us(156us) 132 tty_read+0xbc 0.01% 41.36% 13us(37us) 16us(434us) 810 tty_release+0x1c 0.00% 48.12% 17us(143us) 22us(497us) 133 tty_write+0x1bc 2.08% 41.32% 25us(309us) 18us(29470us) 92009 vfs_create+0x98 0.52% 38.57% 85us(227us) 12us(698us) 6800 vfs_mkdir+0x90 1.10% 38.40% 20us(317us) 14us(1100us) 60359 vfs_readdir+0x68 0.07% 41.66% 12us(78us) 18us(1120us) 6800 vfs_rmdir+0x188 0.00% 100.00% 24us(24us) 21us(27us) 2 vfs_statfs+0x4c 0.60% 36.52% 7.2us(104us) 9.4us(904us) 91805 vfs_unlink+0x110 This tells many things, but - utilization of kernel_flag is about 43% and more than half of that utilization is done by real_lookup. - its average hold-time is not relatively significant, but max wait-time is. - The location sync_old_buffers+0x20 looks responsible for the longest wait-time (95997us). - sync_old_buffers is responsible only for 0.83% of lock utilization, but it has the largest average (32861us) and max (95971us) hold-time. So if we replace the big kernel lock with a fine-grained lock in the real_lookup function, we would see more throughput improvements at leaset for this benchmarck. But I guess the reason for holding the big kernel in real_lookup() is that not all filesystems don't implement an MP-safe lookup routine. Is that correct assumption? For sync_old_buffers, we could hold the big kernel lock per filesystem, for example. static struct dentry * real_lookup(struct dentry * parent, struct qstr * name, int flags) { ... result = d_lookup(parent, name); if (!result) { struct dentry * dentry = d_alloc(parent, name); result = ERR_PTR(-ENOMEM); if (dentry) { lock_kernel(); result = dir->i_op->lookup(dir, dentry); unlock_kernel(); if (result) dput(dentry); else result = dentry; } up(&dir->i_sem); return result; } ... } static int sync_old_buffers(void) { lock_kernel(); sync_supers(0); sync_inodes(0); unlock_kernel(); flush_dirty_buffers(1); /* must really sync all the active I/O request to disk here */ run_task_queue(&tq_disk); return 0; } Mike Kravetz wrote: > > I just posted an updated version of the multi-queue scheduler > for the 2.4.0 kernel. This version also contains support for > realtime tasks. The patch can be found at: > > http://lse.sourceforge.net/scheduling/ > > Here are some very preliminary numbers from sched_test_yield > (which was previously posted to this (lse-tech) list by Bill > Hartner). Tests were run on a system with 8 700 MHz Pentium > III processors. > > microseconds/yield > # threads 2.2.16-22 2.4 2.4-multi-queue > ------------ --------- -------- --------------- > 16 18.740 4.603 1.455 > 32 17.702 5.134 1.456 > 64 23.300 5.586 1.466 > 128 47.273 18.812 1.480 > 256 105.701 71.147 1.517 > 512 FRC 143.500 1.661 > 1024 FRC 196.425 6.166 > 2048 FRC FRC 23.291 > 4096 FRC FRC 47.117 > > *FRC = failed to reach confidence level > > -- > Mike Kravetz mkr...@se... > IBM Linux Technology Center > 15450 SW Koll Parkway > Beaverton, OR 97006-6063 (503)578-3494 > > _______________________________________________ > Lse-tech mailing list > Lse...@li... > http://lists.sourceforge.net/lists/listinfo/lse-tech -- Jun U Nakajima Core OS Development SCO/Murray Hill, NJ Email: ju...@sc..., Phone: 908-790-2352 Fax: 908-790-2426 |