|
From: Michał P. <mic...@gm...> - 2016-10-15 10:46:47
|
Ignore me ;) I described one thing but implemeted something else. I had IRQ pinned to core 0, doing housekeeping and L3 cache trashing happened. With Core 0 - housekeeping Core 1 - IRQ Core 2 - leftovers of softIRQ, kernel side of AFP I have L3 misses 0.6% on Core 1 where packets are delivered. Excellent. On Sat, Oct 15, 2016 at 10:38 AM, Michał Purzyński < mic...@gm...> wrote: > Hey, one more question to the already known (to some of you) tuning > project. We can see excellent performance, findings how to achieve that > will soon be published. There is one more little thing that keeps me up at > night though ;) > > 2x E5-2697 v3, 2x X710 now, one per NUMA node. > > I use isolcpus kernel command line switch, it efficiently removes cores 1+ > from scheduler. IRQ affinity is also used to pin processing of card's data. > > Core 0 - housekeeping > Core 1 - hardware + software IRQ + kernel side of af_packet processing > Cores 2..N - my workload > > When I measured L3 cache hits with > > perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches command > -C1 (and C2, etc) > > I got excellent results - like 0.3% misses or so, on all cores. DCA really > works after lowering the ring size with ethtool - to 512. > > Then I started using RPS - to move some parts of the kernel workload to a > separate core, when I saw Core 1 being frequently pegged > > Core 0 - housekeeping > Core 1 - hardware + parts of software IRQ, RPS starts > Core 2 - rest of software IRQ, kernel side of af_packet processing > Core 3+ - my workload > > L3 cache misses in cores 2, 3 and up are still very low, like 0.4% - but > L3 misses on Core 1 suddenly went up a lot, to at least 8%. > > It kinds of ruins my understanding of how is supposed to work here - > should not it be putting data it the L3 cache? Or am I misinterpreting > results or measuring it wrong? > > |