> On Friday 28 July 2006 16:50, Tom Schutter wrote:
>>I am about to get a new machine. Seeing how valgrind punishes my
>>machine more than any other app, I am letting valgrind control the
>>specs of the new box. In my mind that means more RAM is the primary
>>consideration, fast CPU is second, and everything else comes after.
>>Dual-core or dual CPU won't help. Fast disk is nice, but not
>>I am planning on an AMD Athlon 64 or Opteron CPU. The problem I am
>>having is finding the right motherboard that will support more than
>>4GB of RAM.
Julian Seward replied:
> Interesting proposition; however I have a slightly different
> suggestion. From what we've seen from profiling V, what really
> kills you is cache misses. So maybe consider getting something
> with a huge L2 cache. Core 2 Duo E6400 and above have 4M L2s
> and nice low power consumption.
Just to be sure: the first priority is no actual paging to/from disk.
The capacity for page faults usually is 80 to 200 per second,
which is a delay of 36M to 15M CPU cycles per page fault @ 3GHz.
Check by running /usr/bin/top. The "Swap:" line should have "0k used",
and a Valgrind process should have a constant difference between
VIRT and RES.
Next priority is low cache miss rate (avoid access to RAM.)
The capacity for uncorrelated [non-adjacent] access to RAM
often is 4M to 10M accesses per second, which is a delay of
750 to 300 CPU cycles per cache miss @ 3GHz. Bigger L2 cache
(the 4MB that Julian mentioned) tends to be better.
Making effective use of a large L2 cache also matters. One of the
factors is misses in the TLB (Translation Look-aside Buffer) for
virtual addresses. The TLB is a cache, too. It has a few dozen
entries, and each entry can remember the virtual-to-physical
address translation for one page of either "small" (4KB) or
"huge" (4MB, except 2MB on some CPUs) size. Operating with huge
pages tends to reduce the delay due to TLB misses. A miss in the TLB
costs slightly more than either 1 or several accesses to the data
cache, depending on the underlying architecture (direct-mapped
or reverse-mapped basis for the TLB.)
Another factor in making effective use of a large L2 cache is
aliasing of addresses onto cache lines. The mapping from virtual
address to cache set must be evenly distributed for best throughput.
Using larger pages tends to help, but in most cases the operating
system still must choose physical pages carefully ["page coloring"]
in order to obtain best cache performance.
Unfortunately for best performance of large processes, by default
linux uses small pages and random coloring. Sometimes you lose big,
by a factor of 2 or more in cache throughput, due to the luck of
the draw. You can try using hugetlbfs with explicit mmap(), but
this requires cooperation with Valgrind because the mmap() does
initialize memory as far as memcheck is concerned.
As a practical matter, reaonable systems with more than 3GB to 4GB
of RAM often are considered not to be "personal" machines. Instead,
the market deems them to be "servers" for a workgroup, department,
or enterprise. They are often dual-processor or quad-processor.
[Two vendor names include Tyan and Supermicro.] The cost of the
motherboard alone is often $300 to $1000, and each processor often
costs a similar amount. It often makes sense to use registered,
buffered, ECC RAM, which also costs more than the unregistered,
unbuffered, non-ECC RAM used in "personal" workstations.