From: Waskiewicz J. P. P <pet...@in...> - 2013-03-25 08:41:05
|
On 3/25/2013 1:19 AM, Shinae Woo wrote: > Thanks for the answer! > > We checked that PCIe, but our current environment does not cross PCIe as we > use only 1 numa node. Although we also see that if we cross the PCIe (using > remote numa nodes), the performance is slightly lower. This can happen for a number of reasons. One, you have added latency of crossing QPI. But the biggest cost is the memory fetch and rescheduling of processes on the CPUs when memory isn't local to the running process. > We will reduce the number of queue for support 10G in current status. But > do you know what kinds of processing overhead exist for RSS? Or just > processing RSS itself for multiple queue has the overhead? RSS is pretty much for free when it comes to the HW itself. Before the interrupt happens to the driver, the buffers have already been hashed and DMA'd into the proper queues, so the CPU isn't involved at all. The real issue is lining up caches, where Tx versus Rx of a workload may not be on the same CPU core with RSS. Linux does not have the same concept that Windows has with TSS/RSS, where the RSS table is re-adjusted on the fly to follow a network flow to another queue. So if your test is transmitting on CPU core 0, RSS may return the responses to CPU core 3, thus causing cache thrashing and overhead of process rescheduling. > We run another test that usleep() for between packet reading reduce the > packet drop ratio. > For example, we receive up to 64 packets at one time. > If we add small operations like usleep() or some time-consumed operations, > the packet drop become almost zero. This is probably because you're allowing the rescheduling and cache thrashing time to get through the backlog of packets coming in off the wire. RSS is good to spread out workloads to many cores, but it's random at best. In the 82599 and X540 hardware, using the Flow Director filtering gives much better determinism where flows land. > I have spent almost a month desperately trying to find the answer, but my > efforts were in vain. > Do you think that this problem can be solved at the software level, or is > it unsolvable considering hardware characteristics? I'd investigate Flow Director (also called ATR in the ixgbe driver) to see if it can fit your needs. The major issue is 64-byte packets can only be processed so fast by current CPU's, so adding overhead such as random queue cache-thrashing and process reschedules will result in less-than-ideal performance. And those thrashes and reschedules happen often as a result of using RSS in Linux. RSS and tiny packets won't usually yield great results, and if it does, they're usually not reliably reproducible. Cheers, -PJ |