[lc-devel] Contest Behaviour with Compressed Cache

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

[Con and Rik, it's a long email, but I'd like you to know some of my
conclusions about contest and mem_load. Comments are welcome.]

Contest [1] is a new benchmark by Con Kolivas that aims to test system
responsiveness. It has been announced in the linux kernel mailing list
and since then has attracted much attention from the kernel community.

Two weeks ago, Con kindly sent me some results running his benchmark
on a kernel with compressed cache. The first results were very good,
hence I published them right away with the other statistics. However,
after some bug fixes in contest, we noticed that compressed cache
wasn't improving system performance under memory load test. As a
matter of fact, the performance was worse than a vanilla kernel. Under
IO load it improves and the other loads is indifferent to use
compressed cache or not.

First, let me tell briefly how contest works. It runs a kernel
compilation (with -j4 concurrency level) under different load
conditions. For example, memory load, which is the load I focused on,
uses 110% of the system memory, allocating, touching and moving
memory. There are other loads, like process load, IO load and even
contest benchmarks the compilation under a system without any load
condition.

Given that the idea behind contest benchmark is very interesting, I
think it would be nice to have compressed cache improving the system
performance when running this benchmark under memory load
condition. Even if not possible, at least understand what is going
on. Therefore, I focused on this problem and I think I've come to
interesting conclusions.

Running 2.4.18 vanilla and 2.4.18-0.24pre5 (not released yet), the
results I got were: 

2.4.18:    95.03s completion time  76% CPU usage 
2.4.18-cc: 99.84s completion time  72% CPU usage

First of all, I thought that our problem was the high number of
compressions and decompressions, but in this case we would have a
higher CPU usage.

Checking /proc/stat outputs, I could check that compressed cache
reduces expressively the number of IO performed by system, what should
make the kernel compilation performed by contest to complete
faster. From some profiling data, I also noticed that compressed cache
reduces the time the CPU is in idle state in a very significant
manner: from 20.14 to 1.62 seconds.

The reason why we have a worse completion time is that a kernel with
compressed cache may and probably has a different scheduling in
comparison to a vanilla kernel. That's because we reduce (and very
much, depending on the case) the IO performed by the system. Notably
to service a page fault, IO forces the current process to relinquish
CPU and the scheduler tries to execute another task. Given that we
reduce the total IO, we have much less of this compulsory scheduling
due to a page fault, for example. We also spend some system time
compressing and decompressing memory pages, what sums up to the
current task system time (another reason to a slightly different
scheduling), but that ends up to be less time than performed by IO
operations.

Running contest with the mem_load, a kernel with compressed cache
doesn't perform any swap, very far from the over 60 thousand swapins
and over 70 thousand swapouts performed by vanilla. Concerning
mem_load and kernel compilation, the former has all its IO saved (its
IO is only swapin/out operation), so it doesn't relinquish CPU to
perform IO as on a vanilla kernel. On the other hand, kernel
compilation still has to perform some operations that cannot be saved
(like writing .o files or reading its source files) and even though we
relinquish CPU less than on vanilla because we compress pages from
page cache, it is more than mem_load does.

To be brief, in vanilla case, mem_load is scheduled much more due to
IO (think about the swapins/swapouts mentioned above), giving more
control of the CPU to kernel compilation than on a kernel with
compressed cache. In compressed cache case, mem_load uses most it's
CPU time slice because it doesn't have to perform IO, so kernel
compilation takes less control of CPU (that's why the CPU usage is
smaller), taking a little longer to finish.

In spite of having a worse compilation performance, the system,
generally speaking, runs smoother. The mem_load "for" runs much more
than on vanilla. If you run mem_load with the debug printf()s, it is
quite expressive the difference.

Under memory load condition, contest only measures the time spent to
compile the kernel, but it doesn't take into account how the other
background processes are affected by the compilation. With compressed
cache, this particular background process (mem_load) and the overall
system have better performance. Note that other background processes
might have different results with compressed cache depending on what
they do.

I don't think the current contest, for memory load situation, is
suitable to benchmark a system with compressed cache. It doesn't check
the improvement to the whole system, only for a process, what may be
influenced by scheduling issues.

[1] http://contest.kolivas.net

Regards,
-- 
Rodrigo