[E1000-devel] Measuring LLC misses for DCA

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hey, one more question to the already known (to some of you) tuning
project. We can see excellent performance, findings how to achieve that
will soon be published. There is one more little thing that keeps me up at
night though ;)

2x E5-2697 v3, 2x X710 now, one per NUMA node.

I use isolcpus kernel command line switch, it efficiently removes cores 1+
from scheduler. IRQ affinity is also used to pin processing of card's data.

Core 0 - housekeeping
Core 1 - hardware + software IRQ + kernel side of af_packet processing
Cores 2..N - my workload

When I measured L3 cache hits with

perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches command
-C1 (and C2, etc)

I got excellent results - like 0.3% misses or so, on all cores. DCA really
works after lowering the ring size with ethtool - to 512.

Then I started using RPS - to move some parts of the kernel workload to a
separate core, when I saw Core 1 being frequently pegged

Core 0 - housekeeping
Core 1 - hardware + parts of software IRQ, RPS starts
Core 2 - rest of software IRQ, kernel side of af_packet processing
Core 3+ - my workload

L3 cache misses in cores 2, 3 and up are still very low, like 0.4% - but L3
misses on Core 1 suddenly went up a lot, to at least 8%.

It kinds of ruins my understanding of how is supposed to work here - should
not it be putting data it the L3 cache? Or am I misinterpreting results or
measuring it wrong?

[E1000-devel] Measuring LLC misses for DCA

Moved to github.com/intel

[E1000-devel] Measuring LLC misses for DCA