From: Hubertus F. <fr...@us...> - 2001-02-16 19:20:07
|
John, that's great inside. Since we are currently using the Chatroom purely as a scheduler benchmark, we might think about forking instead of cloning, from a scheduling point we incur "merely" the page-table switch overhead. On second thought, doing so might actually create other problems with TLB invalidates which I believe rely on IPIs (anybody has some quick answers here). We are also experimenting with another benchmark <reflex> from Shailabh, which ofcourse would suffer the same problem, if pipe lockups go through a similar path. Also in defense of Bill Hartner. It was us who were pressing him to release Chatroom, because we were using it for our benchmarking. Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: fr...@us... (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 "John Hawkes" <ha...@en...>@lists.sourceforge.net on 02/16/2001 02:00:49 PM Sent by: lse...@li... To: "Lse-tech" <lse...@li...> cc: Subject: [Lse-tech] "chat" benchmark scalability limited by clone sharing of files_struct I have used the IBM "chat"" benchmark (v1.0.1) to examine scaling on a 32-cpu mips64 (SGI Origin 2000) NUMA system, running 2.4.1-derived kernels with and without the MultiQueue patch (the Jan 29 version), with and without kernprof profiling, with and without lockmetering. Finding #1: For 8, 16, 24, and 32 CPUs there appears to be no *obvious* performance difference between a base 2.4.1 kernel and a 2.4.1+MQ kernel, using "chat" rooms=10, messages=200. At each configuration I ran the test five times, computing a mean of the messages/second throughput result to compare configurations. Individual back-to-back test runs produced results that often varied by 10-15%, which was large enough to make me uncomfortable about using the mean of five results. "No obvious performance difference" means I observed that the MQ kernel was sometimes faster by a few percent and sometimes *slower* by a few percent, and the differences between the means (i.e., comparing 2.4.1 vs. 2.4.1+MQ for a given CPU count) was typically less than the variability of the five test runs for a specific configuration. Finding #2: The performance steadily *decreases* as we go from 8 to 32 CPUs; specifically, the 32-CPU configuration exhibits about *half* the performance of the 8-CPU configuration for this particular test load. I used kernprof-profiling and lockmetering (and some kernel hacking) to diagnose why I witnessed a performance decrease. Finding #3: The "chat" program creates clone'd threads -- 20 "users" for each of the 10 rooms, and a "send" thread and "receive" thread for each user -- and these threads perversely ping-pong the cacheblock containing the shared files_struct.file_lock (include/linux/sched.h). Specifically, sockfd_lookup() (net/socket.c) calls fget() (fs/file_table.c), and fget() does: struct file * fget(unsigned int fd) { struct file * file; struct files_struct *files = current->files; read_lock(&files->file_lock); file = fcheck(fd); if (file) get_file(file); read_unlock(&files->file_lock); return file; } Because clone'd threads share the identical files_struct, we have hundreds of threads doing the read_lock() on the same (rwlock_t)file_lock. This does not cause overt lock contention (because with this test load there is no writer-owner of the file_lock), but it does mean that each read_lock() and each read_unlock() dirties the file_lock word, which produces that cacheblock ping-pong effect when another thread on another CPU accesses that same file_lock word. Experiment: if sockfd_lookup() is hacked to call a new and different procedure, fget_nowait(), that does not do the read_lock() and read_unlock(), then the "chat" clones execute without apparent error and performance is roughly *doubled*, and the 2.4.1 vs. 2.4.1+MQ difference becomes apparent. That is, if we eliminatte the read_lock/unlock we find: (1) The 32-CPU performance is roughly twice the performance of kernels that perform the lock/unlock; and (2) 2.4.1+MQ exhibits about 25% greater performance than the basic 2.4.1 for 32 CPUs, which demonstrates the essential benefit of the MultiQueue approach. Commentary: (1) The "chat" benchmark uncovers a pathological downside to a system load that consists of hundreds of clone'd threads that perform file-related activities (i.e., activities that result in heavy use of fget() or other routines that access the shared files_struct). I believe this pathological downside will be apparent in any shared-memory multiprocessor system -- it is not specific to the mips64 NUMA system. (2) Before we get alarmed about the negative scaling of such cloned threads, let's discuss whether or not the "chat" benchmark load (massive numbers of cloned threads) is an accurate representation of a chatroom server. If it is not, then we ought to abandon "chat" as a generally interesting workload -- even though we may use "chat" as a workload that exhibits a specific kind of pathological clone'd-thread behavior. (3) I am not claiming that my hacked sockfd_lookup() is a valid modification to the kernel. That is, I'm willing to believe that sockfd_lookup() does need to call fget() and to read-lock the files_struct file_lock. (4) Even if we conclude that the "chat" benchmark isn't an ideal general test of the runqueue scheduler, the "chat" benchmark still does provide a useful service to remind us that intense write-access to shared memory locations is a killer of scalable performance. The clone sharing of files_struct isn't the only culprit lurking in the kernel. John Hawkes ha...@en... _______________________________________________ Lse-tech mailing list Lse...@li... http://lists.sourceforge.net/lists/listinfo/lse-tech |