Re: [Lse-tech] "chat" benchmark scalability limited by clone sharing of files_struct

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

John, that's great inside.

Since we are currently using the Chatroom purely as a scheduler benchmark,
we might think about forking instead of cloning, from a scheduling point
we incur "merely" the page-table switch overhead. On second thought, doing
so might actually create other problems with TLB invalidates which I
believe rely on IPIs (anybody has some quick answers here).

We are also experimenting with another benchmark <reflex> from Shailabh,
which ofcourse would suffer the same problem, if pipe lockups go through
a similar path.

Also in defense of Bill Hartner. It was us who were pressing him to release
Chatroom, because we were using it for our benchmarking.

Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)
, OS-PIC (Chair)
email: fr...@us...
(w) 914-945-2003    (fax) 914-945-4425   TL: 862-2003

"John Hawkes" <ha...@en...>@lists.sourceforge.net on 02/16/2001
02:00:49 PM

Sent by:  lse...@li...

To:   "Lse-tech" <lse...@li...>
cc:
Subject:  [Lse-tech] "chat" benchmark scalability limited by clone sharing
      of files_struct

I have used the IBM "chat"" benchmark (v1.0.1) to examine scaling on a
32-cpu mips64 (SGI Origin 2000) NUMA system, running 2.4.1-derived
kernels with and without the MultiQueue patch (the Jan 29 version), with
and without kernprof profiling, with and without lockmetering.

Finding #1:  For 8, 16, 24, and 32 CPUs there appears to be no *obvious*
performance difference between a base 2.4.1 kernel and a 2.4.1+MQ
kernel, using "chat" rooms=10, messages=200.

At each configuration I ran the test five times, computing a mean of the
messages/second throughput result to compare configurations.  Individual
back-to-back test runs produced results that often varied by 10-15%,
which was large enough to make me uncomfortable about using the mean of
five results.  "No obvious performance difference" means I observed that
the MQ kernel was sometimes faster by a few percent and sometimes
*slower* by a few percent, and the differences between the means (i.e.,
comparing 2.4.1 vs. 2.4.1+MQ for a given CPU count) was typically less
than the variability of the five test runs for a specific configuration.

Finding #2:  The performance steadily *decreases* as we go from 8 to 32
CPUs; specifically, the 32-CPU configuration exhibits about *half* the
performance of the 8-CPU configuration for this particular test load.

I used kernprof-profiling and lockmetering (and some kernel hacking) to
diagnose why I witnessed a performance decrease.

Finding #3:  The "chat" program creates clone'd threads -- 20 "users"
for each of the 10 rooms, and a "send" thread and "receive" thread for
each user -- and these threads perversely ping-pong the cacheblock
containing the shared files_struct.file_lock (include/linux/sched.h).

Specifically, sockfd_lookup() (net/socket.c) calls fget()
(fs/file_table.c), and fget() does:

struct file * fget(unsigned int fd)
{
        struct file * file;
        struct files_struct *files = current->files;

        read_lock(&files->file_lock);
        file = fcheck(fd);
        if (file)
                get_file(file);
        read_unlock(&files->file_lock);
        return file;
}

Because clone'd threads share the identical files_struct, we have
hundreds of threads doing the read_lock() on the same
(rwlock_t)file_lock.  This does not cause overt lock contention (because
with this test load there is no writer-owner of the file_lock), but it
does mean that each read_lock() and each read_unlock() dirties the
file_lock word, which produces that cacheblock ping-pong effect when
another thread on another CPU accesses that same file_lock word.

Experiment:  if sockfd_lookup() is hacked to call a new and different
procedure, fget_nowait(), that does not do the read_lock() and
read_unlock(), then the "chat" clones execute without apparent error and
performance is roughly *doubled*, and the 2.4.1 vs. 2.4.1+MQ difference
becomes apparent.  That is, if we eliminatte the read_lock/unlock we
find:
(1) The 32-CPU performance is roughly twice the performance of kernels
that perform the lock/unlock; and
(2) 2.4.1+MQ exhibits about 25% greater performance than the basic 2.4.1
for 32 CPUs, which demonstrates the essential benefit of the MultiQueue
approach.

Commentary:
(1) The "chat" benchmark uncovers a pathological downside to a system
load that consists of hundreds of clone'd threads that perform
file-related activities (i.e., activities that result in heavy use of
fget() or other routines that access the shared files_struct).  I
believe this pathological downside will be apparent in any shared-memory
multiprocessor system -- it is not specific to the mips64 NUMA system.
(2) Before we get alarmed about the negative scaling of such cloned
threads, let's discuss whether or not the "chat" benchmark load (massive
numbers of cloned threads) is an accurate representation of a chatroom
server.  If it is not, then we ought to abandon "chat" as a generally
interesting workload -- even though we may use "chat" as a workload that
exhibits a specific kind of pathological clone'd-thread behavior.
(3) I am not claiming that my hacked sockfd_lookup() is a valid
modification to the kernel.  That is, I'm willing to believe that
sockfd_lookup() does need to call fget() and to read-lock the
files_struct file_lock.
(4) Even if we conclude that the "chat" benchmark isn't an ideal general
test of the runqueue scheduler, the "chat" benchmark still does provide
a useful service to remind us that intense write-access to shared memory
locations is a killer of scalable performance.  The clone sharing of
files_struct isn't the only culprit lurking in the kernel.

John Hawkes
ha...@en...

_______________________________________________
Lse-tech mailing list
Lse...@li...
http://lists.sourceforge.net/lists/listinfo/lse-tech