From: Hubertus F. <fr...@us...> - 2001-02-16 19:20:07
|
John, that's great inside. Since we are currently using the Chatroom purely as a scheduler benchmark, we might think about forking instead of cloning, from a scheduling point we incur "merely" the page-table switch overhead. On second thought, doing so might actually create other problems with TLB invalidates which I believe rely on IPIs (anybody has some quick answers here). We are also experimenting with another benchmark <reflex> from Shailabh, which ofcourse would suffer the same problem, if pipe lockups go through a similar path. Also in defense of Bill Hartner. It was us who were pressing him to release Chatroom, because we were using it for our benchmarking. Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: fr...@us... (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 "John Hawkes" <ha...@en...>@lists.sourceforge.net on 02/16/2001 02:00:49 PM Sent by: lse...@li... To: "Lse-tech" <lse...@li...> cc: Subject: [Lse-tech] "chat" benchmark scalability limited by clone sharing of files_struct I have used the IBM "chat"" benchmark (v1.0.1) to examine scaling on a 32-cpu mips64 (SGI Origin 2000) NUMA system, running 2.4.1-derived kernels with and without the MultiQueue patch (the Jan 29 version), with and without kernprof profiling, with and without lockmetering. Finding #1: For 8, 16, 24, and 32 CPUs there appears to be no *obvious* performance difference between a base 2.4.1 kernel and a 2.4.1+MQ kernel, using "chat" rooms=10, messages=200. At each configuration I ran the test five times, computing a mean of the messages/second throughput result to compare configurations. Individual back-to-back test runs produced results that often varied by 10-15%, which was large enough to make me uncomfortable about using the mean of five results. "No obvious performance difference" means I observed that the MQ kernel was sometimes faster by a few percent and sometimes *slower* by a few percent, and the differences between the means (i.e., comparing 2.4.1 vs. 2.4.1+MQ for a given CPU count) was typically less than the variability of the five test runs for a specific configuration. Finding #2: The performance steadily *decreases* as we go from 8 to 32 CPUs; specifically, the 32-CPU configuration exhibits about *half* the performance of the 8-CPU configuration for this particular test load. I used kernprof-profiling and lockmetering (and some kernel hacking) to diagnose why I witnessed a performance decrease. Finding #3: The "chat" program creates clone'd threads -- 20 "users" for each of the 10 rooms, and a "send" thread and "receive" thread for each user -- and these threads perversely ping-pong the cacheblock containing the shared files_struct.file_lock (include/linux/sched.h). Specifically, sockfd_lookup() (net/socket.c) calls fget() (fs/file_table.c), and fget() does: struct file * fget(unsigned int fd) { struct file * file; struct files_struct *files = current->files; read_lock(&files->file_lock); file = fcheck(fd); if (file) get_file(file); read_unlock(&files->file_lock); return file; } Because clone'd threads share the identical files_struct, we have hundreds of threads doing the read_lock() on the same (rwlock_t)file_lock. This does not cause overt lock contention (because with this test load there is no writer-owner of the file_lock), but it does mean that each read_lock() and each read_unlock() dirties the file_lock word, which produces that cacheblock ping-pong effect when another thread on another CPU accesses that same file_lock word. Experiment: if sockfd_lookup() is hacked to call a new and different procedure, fget_nowait(), that does not do the read_lock() and read_unlock(), then the "chat" clones execute without apparent error and performance is roughly *doubled*, and the 2.4.1 vs. 2.4.1+MQ difference becomes apparent. That is, if we eliminatte the read_lock/unlock we find: (1) The 32-CPU performance is roughly twice the performance of kernels that perform the lock/unlock; and (2) 2.4.1+MQ exhibits about 25% greater performance than the basic 2.4.1 for 32 CPUs, which demonstrates the essential benefit of the MultiQueue approach. Commentary: (1) The "chat" benchmark uncovers a pathological downside to a system load that consists of hundreds of clone'd threads that perform file-related activities (i.e., activities that result in heavy use of fget() or other routines that access the shared files_struct). I believe this pathological downside will be apparent in any shared-memory multiprocessor system -- it is not specific to the mips64 NUMA system. (2) Before we get alarmed about the negative scaling of such cloned threads, let's discuss whether or not the "chat" benchmark load (massive numbers of cloned threads) is an accurate representation of a chatroom server. If it is not, then we ought to abandon "chat" as a generally interesting workload -- even though we may use "chat" as a workload that exhibits a specific kind of pathological clone'd-thread behavior. (3) I am not claiming that my hacked sockfd_lookup() is a valid modification to the kernel. That is, I'm willing to believe that sockfd_lookup() does need to call fget() and to read-lock the files_struct file_lock. (4) Even if we conclude that the "chat" benchmark isn't an ideal general test of the runqueue scheduler, the "chat" benchmark still does provide a useful service to remind us that intense write-access to shared memory locations is a killer of scalable performance. The clone sharing of files_struct isn't the only culprit lurking in the kernel. John Hawkes ha...@en... _______________________________________________ Lse-tech mailing list Lse...@li... http://lists.sourceforge.net/lists/listinfo/lse-tech |
From: John W. <jw...@en...> - 2001-02-16 19:40:37
|
Anybody know how VolanoMark behaves, since this is what chat was modeled after? Anybody know how a realy chatroom server behaves? Can we hack in a parameter of how many clones are created? Is it a matter of just running more rooms, less users? jwright On Fri, Feb 16, 2001 at 02:15:49PM -0500, Hubertus Franke wrote: > John, that's great inside. > > Since we are currently using the Chatroom purely as a scheduler benchmark, > we might think about forking instead of cloning, from a scheduling point > we incur "merely" the page-table switch overhead. On second thought, doing > so might actually create other problems with TLB invalidates which I > believe rely on IPIs (anybody has some quick answers here). > On Fri, Feb 16, 2001 at 02:15:49PM -0500, John Hawkes wrote: > Finding #3: The "chat" program creates clone'd threads -- 20 "users" > for each of the 10 rooms, and a "send" thread and "receive" thread for > each user -- and these threads perversely ping-pong the cacheblock > containing the shared files_struct.file_lock (include/linux/sched.h). -- John Wright - SGI | Email: jw...@en... Scalable Linux Manager | Voice: (650) 933-8899 1200 Crittenden Lane MS:30-3-500 | Pager: (650) 254-9296 Mountain View, CA 94043 | Alpha page: jwr...@pa... |
From: Andi K. <ak...@su...> - 2001-02-16 19:43:33
|
On Fri, Feb 16, 2001 at 02:15:49PM -0500, Hubertus Franke wrote: > Since we are currently using the Chatroom purely as a scheduler benchmark, > we might think about forking instead of cloning, from a scheduling point > we incur "merely" the page-table switch overhead. On second thought, doing > so might actually create other problems with TLB invalidates which I > believe rely on IPIs (anybody has some quick answers here). Linux only does a TLB IPI when a memory space is shared by multiple threads. On a single threaded program with only a single mm user it only flushes on the local CPU, because the MM is guaranteed to be local. A forked chat should run with no TLB IPIs. -Andi |
From: Kanoj S. <ka...@go...> - 2001-02-16 19:58:33
|
> > Since we are currently using the Chatroom purely as a scheduler benchmark, It might be worthwhile checking whether "chat" is a neo-synthetic benchmark. In real life, programs (which understand about performance), might not invoke kernel calls that result in fget()/fput() so rapidly without some substantial work in between, which lowers the ping pong rate of the cache lines. > we might think about forking instead of cloning, from a scheduling point > we incur "merely" the page-table switch overhead. On second thought, doing > so might actually create other problems with TLB invalidates which I > believe rely on IPIs (anybody has some quick answers here). > If anything, you will be lowering ipi's if you switch to a fork model. Your chat program might be less amenable to that though, depending on its sharing model (in this case, you want it to not share fds, you can still share other things till you run into issues). Kanoj |
From: John H. <ha...@en...> - 2001-02-16 20:07:22
|
From: "Hubertus Franke" <fr...@us...> > Since we are currently using the Chatroom purely as a scheduler benchmark, > we might think about forking instead of cloning, from a scheduling point > we incur "merely" the page-table switch overhead. On second thought, doing > so might actually create other problems with TLB invalidates which I > believe rely on IPIs (anybody has some quick answers here). A forked thread also uses much more virtual memory than a cloned thread, so we won't be able to generate as many threads. I find it useful to have a high thread population when testing runqueue scheduling on a large SMP. Cloned threads wouldn't be as much of a problem if they didn't hammer away on the files_struct. That is, a benign cloned thread would be one that simply chews up user cycles and occasionally yields the CPU. (sched_yield might cause its own perverse kernel cacheblock contention, though.) Such a benign thread would be interesting when looking at scheduler behavior, but it certainly isn't likely to be representative of a typical cloned thread. John Hawkes ha...@en... |
From: Kanoj S. <ka...@go...> - 2001-02-16 21:06:41
|
> > From: "Hubertus Franke" <fr...@us...> > > Since we are currently using the Chatroom purely as a scheduler > benchmark, > > we might think about forking instead of cloning, from a scheduling > point > > we incur "merely" the page-table switch overhead. On second thought, > doing > > so might actually create other problems with TLB invalidates which I > > believe rely on IPIs (anybody has some quick answers here). > > A forked thread also uses much more virtual memory than a cloned thread, > so we won't be able to generate as many threads. I find it useful to See my previous mail. You can still use CLONE_VM, just stop using CLONE_FS (if that is possible). Kanoj |
From: John H. <ha...@en...> - 2001-02-16 20:22:58
|
From: "Hubertus Franke" <fr...@us...> > Also in defense of Bill Hartner. It was us who were pressing him to release > Chatroom, because we were using it for our benchmarking. In no way am I criticizing the "chat" benchmark. I find it to be a very useful workload -- it exhibits a true kernel shortcoming that results in negative scaling. The issues are: (1) "chat" isn't very useful for examining scheduler algorithms in large CPU-count configurations, since the fget() ping-ponging outweighs the effects of interesting scheduler changes, (2) "chat" is significantly interesting when analyzing general scaling issues only if "chat" is representative of real-world system loads (i.e., hundreds of cloned threads doing rapid-fire file or socket accesses). Comparing "chat" to Volanomark is interesting (I think) only if Volanomark is itself representive of chatroom server software (in terms of how the threads get created and what the threads do). That is, let's not spend too much time chasing problems that appear only when the system workload is "not real-world" -- and that classification is certainly subject to debate. John Hawkes ha...@en... |
From: Mike K. <mkr...@se...> - 2001-02-17 00:10:12
|
My opinion that we really need two types of benchmarks for the scheduler. 1) We need a benchmark that simply measures/shows scheduler overhead. Something along the lines of the spinning sched_yield() benchmarks. Unfortunately, spinning on sched_yield() only exercises one portion of the scheduler. We really need something that exercises the 'wakeup' path. It is my hope that Shailabh's reflex benchmark (which does token passing via pipes) can be used for this purpose. 2) We also need benchmarks that simulate real workloads and are heavily impacted by scheduler changes. 'chat' is one such benchmark that falls into this category. It is my intention to find other benchmarks (and keep a list of these benchmarks) which do the same. -- Mike Kravetz mkr...@se... IBM Linux Technology Center |