Re: [E1000-devel] Low receive performance with multiple RSS queue

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On 3/25/2013 1:19 AM, Shinae Woo wrote:
> Thanks for the answer!
>
> We checked that PCIe, but our current environment does not cross PCIe as we
> use only 1 numa node. Although we also see that if we cross the PCIe (using
> remote numa nodes), the performance is slightly lower.

This can happen for a number of reasons.  One, you have added latency of 
crossing QPI.  But the biggest cost is the memory fetch and rescheduling 
of processes on the CPUs when memory isn't local to the running process.

> We will reduce the number of queue for support 10G in current status. But
> do you know what kinds of processing overhead exist for RSS? Or just
> processing RSS itself for multiple queue has the overhead?

RSS is pretty much for free when it comes to the HW itself.  Before the 
interrupt happens to the driver, the buffers have already been hashed 
and DMA'd into the proper queues, so the CPU isn't involved at all.  The 
real issue is lining up caches, where Tx versus Rx of a workload may not 
be on the same CPU core with RSS.

Linux does not have the same concept that Windows has with TSS/RSS, 
where the RSS table is re-adjusted on the fly to follow a network flow 
to another queue.  So if your test is transmitting on CPU core 0, RSS 
may return the responses to CPU core 3, thus causing cache thrashing and 
overhead of process rescheduling.

> We run another test that usleep() for between packet reading reduce the
> packet drop ratio.
> For example, we receive up to 64 packets at one time.
> If we add small operations like usleep() or some time-consumed operations,
> the packet drop become almost zero.

This is probably because you're allowing the rescheduling and cache 
thrashing time to get through the backlog of packets coming in off the 
wire.  RSS is good to spread out workloads to many cores, but it's 
random at best.  In the 82599 and X540 hardware, using the Flow Director 
filtering gives much better determinism where flows land.

> I have spent almost a month desperately trying to find the answer, but my
> efforts were in vain.
> Do you think that this problem can be solved at the software level, or is
> it unsolvable considering hardware characteristics?

I'd investigate Flow Director (also called ATR in the ixgbe driver) to 
see if it can fit your needs.  The major issue is 64-byte packets can 
only be processed so fast by current CPU's, so adding overhead such as 
random queue cache-thrashing and process reschedules will result in 
less-than-ideal performance.  And those thrashes and reschedules happen 
often as a result of using RSS in Linux.  RSS and tiny packets won't 
usually yield great results, and if it does, they're usually not 
reliably reproducible.

Cheers,
-PJ

Re: [E1000-devel] Low receive performance with multiple RSS queue

Moved to github.com/intel

Re: [E1000-devel] Low receive performance with multiple RSS queue