From: jamal <ha...@cy...> - 2000-12-25 00:16:42
|
I apologize for the long email. Blame Andrew Morton ;-> Since the OLS, Robert Olson and myself have been looking different schemes to improve what i presented. Summary: Reported < 80Kpps at OLS for SMP. For those who missed the presentation look at: http://robur.slu.se/Linux/net-development/jamal/FF-html/ A few tweaks on the same code and the numbers have gone upto 110Kpps on uni-processor and 130Kpps on SMP. The main peering routers at slu.se PCs running these patches and peering via gated/BGP. Robert has seen in the upwards of 200Kpps on GigE without any tweaks. We'll be presenting more precise results at NORD-USENIX in february. We have basically exceeded the promise at OLS to do 100Mbps wire speed (~148Kpps) by the end of the year. In the meantime, we have each come up with slightly different schemes that we hope will take us to that next level. We've also discovered work done by Robert Morris at MIT on clip which is yet a third approach, fascinating (except for the C++ part ;-<). Robert Morris says he is able to do 333Kpps on a dual PCI SMP machine. ( it would be nice to get access to a beast like that). In my set of experiments (done a few months ago), i have been able to get 140Kpps on a single processor and upto 190Kpps on SMP with IRQ affinity. I belive these numbers can go higher with proper driver tuning but will peak at some point mostly because of the APIC. Gerrit Huizenga and I had a conversation at the OLS in which he pointed out that Linux blindly does RR on receiving interupts and hands them to the next CPU on the list. I have been informed that this is infact a property of the APIC ;-> I believe i am being bitten by this at the moment. I am hoping to test his patches at some point (unfortunately this christmas break i have a more important/exciting project i am working on). On Sun, 24 Dec 2000, Andrew Morton wrote: > Gerrit Huizenga wrote: > > > > Andi Kleen wrote: > > > > > You think it wouldn't help for database servers ? (e.g. with a NIC and a > > > SCSI controller per CPU) > > > > > > -Andi > > > > I think NIC to CPU binding would simply increase the latency problem > > for interrupt delivery in this case. Allowing the APIC to direct an > > interrupt to the first available CPU decreases the average interrupt > > delivery latency. And, I'd guess that the interrupt latency more > > likely governs the throughput than the sharing of a few cache lines ( 1 > > / nprocessors) of the time on most modern SMP systems. > > > > Of course, this depends a lot on how long a lock is held (lock_irq()). > > I had heard that the number of instructions generally that a lock > > was held in Linux was *very* small, although some code I've looked > > at doesn't seem to bear that out (at least not any more). Sounds logical. I think being able to dynamically route IRQ under some policy mechanism (such as the one available for IRQ affinity right now) would help a great deal. > > I had a few wild thoughts on this topic earlier in the year. I haven't > had a chance to do anything with them because people keep on putting > bugs in the kernel :) > > * presume that interrupts are wickedly expensive and we want to > minimise them. This is more relevant to low-end (100mbit) NICs. > > * presume that cross-CPU traffic and cache misses are expensive, and > we want to optimise for these. > OLS solution was interupt mitigation: Cache amortization is achieved because you grab many packets per unit time (as well reduced interupts are a given). This does not reduce the x-CPU traffic, of course, but there was none introduced to start with. > Some avenues for investigation: > > * Disable the NIC's interrupts at the hardware level when we're doing > receive processing. > > This would be a big performance win on uniprocessor - there's no > *point* in taking the Rx interrupt when we're doing protocol > processing - we're just going to queue the packet and go back to > protocol processing. > If i understand you correctly, you are saying that when you are procesing packets from the backlog, you shutdown every NICs rx interupt. > I think it's also a performance win on SMP. If we're using > NIC->CPU bonding then it's basically a UP problem anyway. > I think the better idea is to totaly avoid having to do NIC->CPU bonding and still achieve very good results. > So it's better to disable the Rx interrupts at the end of the Rx > ISR if we have sent something to netif_rx(). At the end of > net_rx_action() processing we call back into the driver to see if it > has more Rx frames available. If there are, well, we just process > them as well, still with hardware interrupts disabled. This is > super-quick. If there aren't any Rx packets available, turn on > Rx interrupts. OK. this is definetely a 4th way of doing things. I am not sure how you are going to solve the fairness issue totaly but it is definetely a different way of doing things. > > Note that this magically fixes the SMP packets-out-of-order problem > as well, independent of any NIC<->CPU bonding. > I dont see how you are going to achieve this. But lately packet reodering has become a lesser issue (at least for TCP). > We lose the capability to deliver an incoming packet to a different > socket on a different CPU while we're doing protocol processing, but > is that valuable? A net loss? What do you mean by socket on different CPU? > > * Disable Tx interrupt altogether. Gone. Dead. > From my experiments, this is a very bad idea. Robert has also ended with the same conclusion on a different test. The problem is the system timer granularity. I dont think you can be more accurate than a transmit interupt event ;-> > Instead, do the tx descriptor reaping within the driver's > start_xmit method. Also within the (now very occasional) Rx > interrupt. > I thought this was common. Maybe only on the tulip (or maybe our patched version). > This would have to be backed up with a timer of some sort. I > expect that a one millisecond timer would be sufficiently short to > avoid screwing up TCP. You'd keep pushing it back in time each time > you reaped some Tx descriptors, so under heavy load it would never > fire. > > If the timer _does_ fire then you can assume that there isn't much > network load and it may be best to reenable Tx interrupts just so you > can turn the 1 kHz timer interrupt off. Sounds very complicated really. But i think the key is experimentation. > > * Poll for Tx descriptor reaping in the Rx interrupt. Poll for Rx > packets in the start_xmit method. Save interrupts. With the above > two tricks, we get *zero* interrupts per packet under heavy load. > > "Ah-ha!", you say, "what about latency?". Well, yes, this scheme > introduces up to one millisecond latency in the very specific case > where traffic is falling from a high level to a low one, which may > make it inappropriate for some classes of LAN application, but I suspect > that the effects will be low. Plus there are a number of things here > which *decrease* latency, such as reducing the interrupt count under > load. > Again, i think the key is experimentation. You have generally come up with a 4th way of doing things. The more people try different schemes, the better. I would say implement then come up with numbers. Lets do it the _old_ IETF way: "we believe in running code" > * Dynamic interrupt bonding. > I like this idea very much. Even more i like the idea of also maintaining the current softnet scheme of things where you have multiple concurent softirqs, one on each processor. > Some very brief testing on a 2-way indicates that TCP is a little > more efficient when you hardwire the NIC to the CPU. > Which kernel was this? And what throughputs were you experimenting with? > I was thinking of a simple heuristic where you simply keep track of > which CPU sends the most packets in a one-second time period. At the > end of that period, subject to some hysteresis and thresholding, bond > the NIC's interrupt to that CPU. Repeat each second. > > This assumes that a preponderation of Tx packet count correlates > with one of Rx packet count, which seems fairly sane to me. > > Note that this scheme (and many other bonding schemes) will come > horridly unstuck if multiple NICs are sharing the same interrupt! > Don't do that. > > One thing which concerns me about _any_ scheme which involves dynamic > APIC reprogramming is that wierd things are likely to happen if we > reprogram APICs when we're under load. PCs are crap, and we're already > subject to a worrisome number of strange APIC problems. Trying to give > the APIC a brain transplant while it is handling 5,000 interrupts per > second seems like a recipe for problems. > I think some load balancing heuristic is needed. I think the heurusitic should _not_ be based on counting packets only but rather on CPU load. For example, if your IDE is trashing a lot of interupts then you want to take this into account as well. As well if a CPU is running a lot of user processes you need to take into account those issues. But definetely some form of dynamic IRQ routing is a good idea. Sounds like a very exciting project (and a conference paper). From my conversation with Gerrit they seem to have solved this. My knowledge of the APIC is very sparse and i dont have time at the moment. Dynamic IO-APIC reprogramming, if it can be done very efficiently is a definete win. But like i said i dont have the knowledge there and i believe in numbers. Seems Gerrit and co had some scheme of figuring which CPU is least loaded and handing the interupt to it. Also, i am not sure how the currently highly parallel scalable softnet scheme is going to be maintained. > Last time I looked, Alphas didn't have APICs. We need to design a > sensible architecture-neutral interrupt bonding API (or at least a > queryable one) before we run off making x86-specific changes. How is IRQ affinity achieved on Alphas then? ;-> I thought this was a common thing that comes in the PCI package. And if they cant do IRQ affinity, i would say they deserve to miss this boat as well. > > As a footnote, and I know this won't be a popular view on lse-tech - BTW, what is this LSE list? I just joined it. > philosophically speaking I believe that 2.4 has given enough to the > big-end guys. I hope that in 2.5, more emphasis and kernel developer > talent will be devoted to the other 99.99% of Linux users. Better > device support, plug-n-play, manageability, upgradability, etc. Linux > seems to be becoming more and more a server OS lately and I'd like to > see that turned around. > > Of course the three-letter corps need the scalability. Good luck to > them and thanks for supporting Linux. Andrew, I am basically in agreement with you on this, But, realistically: do you think these three letter corps care about anything that doesnt serve them? I dont see SGI trying to help make Linux more user friendly or even the justification from their corporate perspective. This also is going to affect all other "traditional" Linux companies. In fact it might be already. It is ok to let them serve their corporate interests as long as they help Linux. It is healthy _as long as_ there are no competing goals at the kernel level, with one of them getting into the kernel. Competing goals will result in Linux forks. This might not be the case for user space (Gnome vs KDE) unless the case involves exposing some APIs from the kernel. > For the privateers, yes, it's > *fun* to make Linux faster and it is gratifying, but we need to be > aware that it is also *easy*. Solving the problems which are faced by > the wider community of Linux users is going to be dull, and hard. > I am not sure if *easy* is correct description here. Fun, yes. That is why i participate (and maybe you as well). And if it is fun by definition it means i do what i like. I think a combination of people like us results in an overall improved Linux as long as there are not too many overlaps. And even overlaps might not be a bad idea if you have plenty of time. I find Gnome development boring, dull, and hard (not that i cant do it if you pointed a gun at me). I am sure the Gnome people think the same about what i do. cheers, jamal |