|
From: Michał P. <mic...@gm...> - 2016-10-15 08:39:08
|
Hey, one more question to the already known (to some of you) tuning project. We can see excellent performance, findings how to achieve that will soon be published. There is one more little thing that keeps me up at night though ;) 2x E5-2697 v3, 2x X710 now, one per NUMA node. I use isolcpus kernel command line switch, it efficiently removes cores 1+ from scheduler. IRQ affinity is also used to pin processing of card's data. Core 0 - housekeeping Core 1 - hardware + software IRQ + kernel side of af_packet processing Cores 2..N - my workload When I measured L3 cache hits with perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches command -C1 (and C2, etc) I got excellent results - like 0.3% misses or so, on all cores. DCA really works after lowering the ring size with ethtool - to 512. Then I started using RPS - to move some parts of the kernel workload to a separate core, when I saw Core 1 being frequently pegged Core 0 - housekeeping Core 1 - hardware + parts of software IRQ, RPS starts Core 2 - rest of software IRQ, kernel side of af_packet processing Core 3+ - my workload L3 cache misses in cores 2, 3 and up are still very low, like 0.4% - but L3 misses on Core 1 suddenly went up a lot, to at least 8%. It kinds of ruins my understanding of how is supposed to work here - should not it be putting data it the L3 cache? Or am I misinterpreting results or measuring it wrong? |