Re: [Lse-tech] [OSDL] Scalability Issues with database performance run

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

This is a summary of where I think we are with the DBT benchmark
work on an 8-way system.  I'd appreciate any suggestions on other
approaches, or ways to get interesting results more quickly.

Our goal isn't to necessarily produce high DBT numbers: 
large numbers of DBT transactions per second for example.  Instead,
it's to construct an environment where the kernel can be studied
doing "interesting" things.  Things of interest would include
system panics, or kernel algorithms that aren't scaling well.

I'm looking at some of the results we've been getting on an 8-proc
system running the SAP database.  So far, the results are not as
interesting as we'd like, mostly because the database we're running
is too small. I think SAP is basically loading the database
into its cache.  After that, the system runs pretty much just
in user mode, with less than 5% kernel.  We haven't profiled
this environment yet.  That might be interesting. But I don't
THINK it's the priority right now.

We're trying to maintain a stable environment while
we learn to grow the benchmark.  So in the shor term, we're running
2.4.18.  We've learned about some limitations
in the client simulation software that limit it to a maximum of 800
users.  Eventually, we probably need to go back and fix the software.
But in the mean time, we're adding more client machines to simulate
more users.  There are other configuration issues, such as the number
of database connections, and so on, that we're learning to adjust.

Once we get to a number of users that we're using pretty much all
our cpus time, we intend to grow the database.  I would expect
this to exercise the kernel more, doing I/O and paging.  This
is when results should be more interesting.

At that point, I think the 2.4 kernels are less interesting than
the 2.5.  I'd like us to begin doing benchmark runs with the latest
2.5.30 kernel, and then begin adding patches of interest to lse-tech
or others on lkml.  We could initially produce sar statistics.
But, I think doing kernel profile and lock meter runs on these would
be valuable as well.

I know there is cost to doing kernel profiling and lockmeter
data collection.  Does it make sense to COMBINE lockmeter and
profiling in one kernel?  I suspect it doesn't.  Likewise how much
does profiling or lockmeter data collection corrupt the sar data?
Since it takes about half a day to do one benchmark run, it would
be good if we could run a profiled kernel right away, and still be
able to trust the sar data.  

There's also a kernel profiling patch from SGI.  It does more than
the built-in profiling.  It recompiles the kernel to do mcount
data collection, and so on. This is more costly than the native
profiling.  How much value is there in this extra data?

We also have requests into the SAP mail list asking for more
information about how SAP works.  It's obviously caching database
information.  We'd like to know how it manages its cache, and what
limitations there are on its caching implementation.  We're also
looking into how it strips I/O.

Let me know what you think!

Dave