From: Nick P. <npo...@sg...> - 2000-12-21 19:20:00
|
Here are some various code snippets for doing process pinning. We use 'runon' which has the following syntax: runon processor command [args] or runon processor -p pid I will be on holiday for the next week, and will be available to field questions/requests after the new year. Patches are built against 2.4.0-test12. Patch #1: The changes to array.c cause /proc/pid/stat to print the cpus_allowed field from task_struct. This isn't necessary for process pinning, but is a nice diagnostic feature. Patch #2: The changes to prctl.h are necessary for compiling runon.c. Patch #3: The changes to sys.c perform the guts of the work. The cases for PR_GET_RUNON and PR_SET_RUNON were developed by Dimitris and/or Ingo. I added the case for PR_MUSTRUN_PID. PR_SET_RUNON and PR_MUSTRUN_PID are used by runon. Patch #4: Not shown below is a scheduler patch from IBM that seemed to make this more reliable. The patch is available from: http://www.geocrawler.com/archives/3/5312/2000/9/0/4295221/ Runon.c: This is the code for runon. **Begin Patch diff -Nur -X dontdiff basetree/fs/proc/array.c mylinux/fs/proc/array.c --- basetree/fs/proc/array.c Sat Nov 25 00:09:18 2000 +++ mylinux/fs/proc/array.c Tue Dec 12 14:31:08 2000 @@ -347,7 +347,7 @@ read_unlock(&tasklist_lock); res = sprintf(buffer,"%d (%s) %c %d %d %d %d %d %lu %lu \ %lu %lu %lu %lu %lu %ld %ld %ld %ld %ld %ld %lu %lu %ld %lu %lu %lu %lu %lu \ -%lu %lu %lu %lu %lu %lu %lu %lu %d %d\n", +%lu %lu %lu %lu %lu %lu %lu %lu %d %d %lu\n", task->pid, task->comm, state, @@ -390,7 +390,8 @@ task->nswap, task->cnswap, task->exit_signal, - task->processor); + task->processor, + task->cpus_allowed); if(mm) mmput(mm); return res; diff -Nur -X dontdiff basetree/include/linux/prctl.h mylinux/include/linux/prctl.h --- basetree/include/linux/prctl.h Mon Mar 20 10:07:55 2000 +++ mylinux/include/linux/prctl.h Mon Dec 11 18:15:26 2000 @@ -20,4 +20,8 @@ #define PR_GET_KEEPCAPS 7 #define PR_SET_KEEPCAPS 8 +#define PR_GET_RUNON 9 +#define PR_SET_RUNON 10 +#define PR_MUSTRUN_PID 11 + #endif /* _LINUX_PRCTL_H */ diff -Nur -X dontdiff basetree/kernel/sys.c mylinux/kernel/sys.c --- basetree/kernel/sys.c Wed Nov 1 13:38:31 2000 +++ mylinux/kernel/sys.c Tue Dec 19 18:38:54 2000 @@ -1203,11 +1203,62 @@ } current->keep_capabilities = arg2; break; + case PR_GET_RUNON: + error = put_user(current->cpus_allowed, (long *)arg2); + break; + case PR_SET_RUNON: + if (arg2 == 0) + arg2 = 1 << smp_processor_id(); + arg2 &= cpu_online_map; + if (!arg2) + error = -EINVAL; + else { + current->cpus_allowed = arg2; + if (!(arg2 & (1 << smp_processor_id()))) + current->need_resched = 1; + } + break; + case PR_MUSTRUN_PID: + /* arg2 is cpu, arg3 is pid */ + if (arg2 == 0) + arg2 = 1 << smp_processor_id(); + arg2 &= cpu_online_map; + if (!arg2) + error = -EINVAL; + error = mp_mustrun_pid(arg2, arg3); + break; default: error = -EINVAL; break; } return error; +} + +static int mp_mustrun_pid(int cpu, int pid) +{ + struct task_struct *p; + int ret; + + ret = -EPERM; + /* Not allowed to change 1 */ + if (pid == 1) + goto out; + + read_lock(&tasklist_lock); + p = find_task_by_pid(pid); + if (p) + get_task_struct(p); + read_unlock(&tasklist_lock); + if (!p) + ret = -ESRCH; + + p->cpus_allowed = cpu; + p->need_resched = 1; + ret = 0; + free_task_struct(p); + +out: + return ret; } EXPORT_SYMBOL(notifier_chain_register); **End patch **Begin runon.c /* * runon.c - assign a process to a named processor * * Copyright (c) 2000 Silicon Graphics, Inc. All Rights Reserved. * * This program is free software; you can redistribute it and/or modify it * under the terms of version 2 of the GNU General Public License as * published by the Free Software Foundation. * * This program is distributed in the hope that it would be useful, but * WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. * * Further, this software is distributed without any warranty that it is * free of the rightful claim of any third person regarding infringement * or the like. Any license provided herein, whether implied or * otherwise, applies only to this software file. Patent licenses, if * any, provided herein do not apply to combinations of this program with * other software, or any other product whatsoever. * * You should have received a copy of the GNU General Public License along * with this program; if not, write the Free Software Foundation, Inc., 59 * Temple Place - Suite 330, Boston MA 02111-1307, USA. * * Contact information: Silicon Graphics, Inc., 1600 Amphitheatre Pkwy, * Mountain View, CA 94043, or: * * http://www.sgi.com * * For further information regarding this notice, see: * * http://oss.sgi.com/projects/GenInfo/SGIGPLNoticeExplan/ */ #include <stdio.h> #include <unistd.h> #include <linux/prctl.h> #define _POSIX_OPTION_ORDER static void usage(); main(argc, argv) int argc; char *argv[]; { extern errno; extern __const char *__const sys_errlist[]; int c, processor, pid; int usepid = 0; register char *p; if (argc < 3) { usage(); exit(1); } while ((c = getopt (argc, argv, "p:")) != -1) { switch(c) { case 'p': usepid = 1; pid = atoi(optarg); break; } } p = argv[optind]; while(*p) { if(!isdigit(*p)) { fprintf(stderr, "%s: cpu argument must be numeric.\n", argv[0]); exit(2); } p++; } processor = atoi(&argv[optind][0]); optind++; if (usepid) { if (prctl(PR_MUSTRUN_PID, processor, pid, 0, 0) < 0) { fprintf(stderr, "%s: could not attach pid %d to processor %d\n", argv[0], pid, processor); exit(1); } } else { if (prctl(PR_SET_RUNON, processor, 0, 0, 0) < 0) { fprintf(stderr, "%s: could not attach to processor %d -- %s\n ", argv[0], processor, sys_errlist[errno]); exit(1); } execvp(argv[optind], &argv[optind]); fprintf(stderr, "%s: %s\n", sys_errlist[errno], argv[optind]); exit(1); } return(0); } static void usage() { fprintf(stderr, "usage: runon processor (-p pid | command [args...])\n"); } **End runon.c -- ------------------------------------------------------------------ Nick Pollitt phone: 650.933.7406 MTS - Design fax: 650.933.3542 SGI npo...@sg... |
From: Tim H. <th...@co...> - 2000-12-21 19:31:11
|
Nick Pollitt wrote: > > Here are some various code snippets for doing process pinning. We use 'runon' > which has the following syntax: > runon processor command [args] > or > runon processor -p pid Please note that I did a port of the sysmp() system call and pset tools to linux quite some time ago - http://isunix.it.ilstu.edu/~thockin/pset. Linus and Alan generally rejected the patch as "bloat". -- Tim Hockin Software/OS Engineer Sun Microsystems, Cobalt Server Appliance Business Unit th...@co... |
From: Andi K. <ak...@su...> - 2000-12-21 21:10:29
|
On Thu, Dec 21, 2000 at 11:36:52AM -0800, Tim Hockin wrote: > Nick Pollitt wrote: > > > > Here are some various code snippets for doing process pinning. We use 'runon' > > which has the following syntax: > > runon processor command [args] > > or > > runon processor -p pid > > Please note that I did a port of the sysmp() system call and pset tools > to linux quite some time ago - http://isunix.it.ilstu.edu/~thockin/pset. > > Linus and Alan generally rejected the patch as "bloat". Their views will need to change now. bind to cpu is needed to get any good performance with softnet and multiple NICs (at least until Linux learns dynamic APIC tuning to redirect IO interrupts to the right CPUs) -Andi |
From: Tim W. <ti...@sp...> - 2000-12-21 21:24:29
|
Hi Andi, sounds like we should be talking. What you say is exactly in line with some work we have planned here. We're currently looking to prototype use of APIC priority programming here in Beaverton. Simply put, in DYNIX/ptx we always set the local APIC priority to indicate what the processor is doing. Doing this means that interrupts always get delivered to the least loaded (preferably) idle CPU. We were thinking that, at least initially, we would not add interrupt priority levels, or that, if we did, the primitives would support them, but that we would simply change from the current use of 'cli' to a two level scheme i.e. all enabled or all disabled, but at least performed at the APIC. Naturally, we'll need to benchmark this to see what performance changes occur. I have heard reservations expressed regarding the overhead of the APIC programming vs. the cost of 'cli/sti', and it remains to be seen whether the improved interrupt latency and throughput makes up for the added cost. Regards, Tim On Thu, Dec 21, 2000 at 10:10:22PM +0100, Andi Kleen wrote: > On Thu, Dec 21, 2000 at 11:36:52AM -0800, Tim Hockin wrote: > > Nick Pollitt wrote: > > > > > > Here are some various code snippets for doing process pinning. We use 'runon' > > > which has the following syntax: > > > runon processor command [args] > > > or > > > runon processor -p pid > > > > Please note that I did a port of the sysmp() system call and pset tools > > to linux quite some time ago - http://isunix.it.ilstu.edu/~thockin/pset. > > > > Linus and Alan generally rejected the patch as "bloat". > > Their views will need to change now. bind to cpu is needed to get any good > performance with softnet and multiple NICs (at least until Linux learns > dynamic APIC tuning to redirect IO interrupts to the right CPUs) > > > -Andi > > _______________________________________________ > Lse-tech mailing list > Lse...@li... > http://lists.sourceforge.net/mailman/listinfo/lse-tech -- Tim Wright - ti...@sp... or ti...@ar... or tw...@us... IBM Linux Technology Center, Beaverton, Oregon "Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI |
From: Andi K. <ak...@su...> - 2000-12-21 21:56:52
|
On Thu, Dec 21, 2000 at 01:24:16PM -0800, Tim Wright wrote: > Hi Andi, > sounds like we should be talking. What you say is exactly in line with some > work we have planned here. We're currently looking to prototype use of > APIC priority programming here in Beaverton. > > Simply put, in DYNIX/ptx we always set the local APIC priority to indicate > what the processor is doing. Doing this means that interrupts always get > delivered to the least loaded (preferably) idle CPU. > > We were thinking that, at least initially, we would not add interrupt priority > levels, or that, if we did, the primitives would support them, but that we > would simply change from the current use of 'cli' to a two level scheme i.e. > all enabled or all disabled, but at least performed at the APIC. > > Naturally, we'll need to benchmark this to see what performance changes occur. > I have heard reservations expressed regarding the overhead of the APIC > programming vs. the cost of 'cli/sti', and it remains to be seen whether the > improved interrupt latency and throughput makes up for the added cost. Sending the interrupts to the CPU with the lowest load would be very different from what I was thinking. What Tux does is to try to make sure that NIC interrupts are always taken on the same CPU as the thread that will process the packet. This way cross SMP traffic for the skbuff and socket state is avoided. In their SpecWeb run they just did it manually, binding NICs to CPUs using /proc/irq/IRQ#/smp_affinity and binding the worker thread for that to the same CPU. There was also some preliminary ideas (mostly from Andrew Morton, Ingo Molnar) to do this binding dynamically. Keep some statistics on how often a skbuff crosses a CPU from which NIC and to which thread(s), and then change the APIC and the thread's cpu goodness to try to redirect the IRQ to the Problem with that is that it may fail spectacularly on some non benchmark loads when there are e.g. two CPU intensive threads that get feed from the same NIC then this heuristic could pessimize things. To avoid this some knob for manual tuning is still needed, preferably with a soft cpu bind to handle varying load: just give a thread a preference to run on a CPU, but do not 100% bind it. I think the main difference from this approach to yours is that it tries to minimize cross CPU traffic and not only load on the CPUs (assuming that the CPUs are much faster than the bus). The scheme could probably be extended to other IO too, but it is not clear if it is worth there as much because block IO tends to have much less state than networking. -Andi |
From: Tim W. <ti...@sp...> - 2000-12-21 22:39:10
|
Yes, this is different, and probably an equally valid thing to be looking at. Web servers still seem to be a fairly important class of application, recent downturns in the stock market notwithstanding ! Tim On Thu, Dec 21, 2000 at 10:56:49PM +0100, Andi Kleen wrote: > On Thu, Dec 21, 2000 at 01:24:16PM -0800, Tim Wright wrote: > > Hi Andi, > > sounds like we should be talking. What you say is exactly in line with some > > work we have planned here. We're currently looking to prototype use of > > APIC priority programming here in Beaverton. > > > > Simply put, in DYNIX/ptx we always set the local APIC priority to indicate > > what the processor is doing. Doing this means that interrupts always get > > delivered to the least loaded (preferably) idle CPU. > > > > We were thinking that, at least initially, we would not add interrupt priority > > levels, or that, if we did, the primitives would support them, but that we > > would simply change from the current use of 'cli' to a two level scheme i.e. > > all enabled or all disabled, but at least performed at the APIC. > > > > Naturally, we'll need to benchmark this to see what performance changes occur. > > I have heard reservations expressed regarding the overhead of the APIC > > programming vs. the cost of 'cli/sti', and it remains to be seen whether the > > improved interrupt latency and throughput makes up for the added cost. > > Sending the interrupts to the CPU with the lowest load would be very different > from what I was thinking. What Tux does is to try to make > sure that NIC interrupts are always taken on the same CPU as the thread that > will process the packet. This way cross SMP traffic for the skbuff and > socket state is avoided. In their SpecWeb run they just did it manually, > binding NICs to CPUs using /proc/irq/IRQ#/smp_affinity and binding the worker > thread for that to the same CPU. > > There was also some preliminary ideas (mostly from Andrew Morton, Ingo Molnar) > to do this binding dynamically. Keep some statistics on how often a skbuff > crosses a CPU from which NIC and to which thread(s), and then change the APIC > and the thread's cpu goodness to try to redirect the IRQ to the > > Problem with that is that it may fail spectacularly on some non benchmark > loads when there are e.g. two CPU intensive threads that get feed from the > same NIC then this heuristic could pessimize things. > > To avoid this some knob for manual tuning is still needed, preferably with > a soft cpu bind to handle varying load: just give a thread a preference to run > on a CPU, but do not 100% bind it. > > I think the main difference from this approach to yours is that it tries > to minimize cross CPU traffic and not only load on the CPUs (assuming that the CPUs > are much faster than the bus). The scheme could probably be extended to other > IO too, but it is not clear if it is worth there as much because block IO > tends to have much less state than networking. > > > -Andi -- Tim Wright - ti...@sp... or ti...@ar... or tw...@us... IBM Linux Technology Center, Beaverton, Oregon "Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI |
From: Andi K. <ak...@su...> - 2000-12-21 22:57:46
|
On Thu, Dec 21, 2000 at 02:38:57PM -0800, Tim Wright wrote: > Yes, this is different, and probably an equally valid thing to be looking at. > Web servers still seem to be a fairly important class of application, recent > downturns in the stock market notwithstanding ! You think it wouldn't help for database servers ? (e.g. with a NIC and a SCSI controller per CPU) -Andi |
From: Tim W. <ti...@sp...> - 2000-12-22 00:48:56
|
Hmmm.... depends on the database and the workload etc. This might be feasible on small database servers, but I'm used to systems with thousands of connections performing transactions. OLTP workloads tend to involve a lot of shared information that gets pinged around the CPUs anyway, so it's not clear to me how giving I/O device drivers affinity would necessarily win you very much. For DSS workloads, full-table scans etc. it's conceivable that such an approach would yield measurable performance improvements. Locality of main memory is a more pressing concern on NUMA systems. Here at IBM Beaverton (what was Sequent), we use Fibre-Channel controllers and switches to achieve a multi-pathed I/O fabric where you can ensure that disk I/O is performed from the local quad (4xprocessor+memory+PCI building block). This also improves fault-tolerance and yields some impressive I/O figures. We were getting 3GB/s data read doing full-table scans in DB2 about a year ago. I suspect that fundamentally, the degree to which binding devices to processors is a win is determined by the degree of information sharing in the application. i.e. web servers work beautifully with this because it's sort of like a "cluster in a box" i.e. a 4-way server sort of looks like four 1-way servers, but with common memory and disks - there's not really much information passed between task. There are other workloads where this is not true. Anyway, hopefully we'll get to test some and this, measure it and see what happens. Tim On Thu, Dec 21, 2000 at 11:57:43PM +0100, Andi Kleen wrote: > On Thu, Dec 21, 2000 at 02:38:57PM -0800, Tim Wright wrote: > > Yes, this is different, and probably an equally valid thing to be looking at. > > Web servers still seem to be a fairly important class of application, recent > > downturns in the stock market notwithstanding ! > > You think it wouldn't help for database servers ? (e.g. with a NIC and a > SCSI controller per CPU) > > -Andi -- Tim Wright - ti...@sp... or ti...@ar... or tw...@us... IBM Linux Technology Center, Beaverton, Oregon "Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI |
From: Gerrit H. <ge...@us...> - 2000-12-23 00:52:15
|
Andi Kleen wrote: > You think it wouldn't help for database servers ? (e.g. with a NIC and a > SCSI controller per CPU) > > -Andi I think NIC to CPU binding would simply increase the latency problem for interrupt delivery in this case. Allowing the APIC to direct an interrupt to the first available CPU decreases the average interrupt delivery latency. And, I'd guess that the interrupt latency more likely governs the throughput than the sharing of a few cache lines ( 1 / nprocessors) of the time on most modern SMP systems. Of course, this depends a lot on how long a lock is held (lock_irq()). I had heard that the number of instructions generally that a lock was held in Linux was *very* small, although some code I've looked at doesn't seem to bear that out (at least not any more). gerrit |
From: Andi K. <ak...@su...> - 2000-12-23 04:30:14
|
On Fri, Dec 22, 2000 at 04:52:06PM -0800, Gerrit Huizenga wrote: > > Andi Kleen wrote: > > > You think it wouldn't help for database servers ? (e.g. with a NIC and a > > SCSI controller per CPU) > > > > -Andi > > I think NIC to CPU binding would simply increase the latency problem > for interrupt delivery in this case. Allowing the APIC to direct an > interrupt to the first available CPU decreases the average interrupt > delivery latency. And, I'd guess that the interrupt latency more > likely governs the throughput than the sharing of a few cache lines ( 1 > / nprocessors) of the time on most modern SMP systems. What current Linux and what is still the default is to simple try to distribute them evenly over all CPUs, ignoring user space load and irq lock. Linux 2.0 took the lock into accoun, but it was still using a big kernel lock for everything including interrupts so it may have made more sense there. It also used a software scheme (spin on the lock some time, after that redirect the interrupt to some other CPU) > > Of course, this depends a lot on how long a lock is held (lock_irq()). > I had heard that the number of instructions generally that a lock > was held in Linux was *very* small, although some code I've looked > at doesn't seem to bear that out (at least not any more). It depends on the configuration. A common pig is the (irqsafe) io_request_lock when the low level driver does not cooperate. The Linux SCSI layer holds it over requests for compatibility with 2.0 driver locking rules, and relies on the driver to drop it when it is safe to do so. Not all drivers do that and it is hold for a long time anyways. I hope that will be cleaned up in 2.5 with the kiovec block device overhaul. Another pig but that should hopefully not occur on database servers is the IDE driver when hdparm -u1 is not used. -Andi |
From: Andrew M. <an...@uo...> - 2000-12-24 04:18:51
|
Gerrit Huizenga wrote: > > Andi Kleen wrote: > > > You think it wouldn't help for database servers ? (e.g. with a NIC and a > > SCSI controller per CPU) > > > > -Andi > > I think NIC to CPU binding would simply increase the latency problem > for interrupt delivery in this case. Allowing the APIC to direct an > interrupt to the first available CPU decreases the average interrupt > delivery latency. And, I'd guess that the interrupt latency more > likely governs the throughput than the sharing of a few cache lines ( 1 > / nprocessors) of the time on most modern SMP systems. > > Of course, this depends a lot on how long a lock is held (lock_irq()). > I had heard that the number of instructions generally that a lock > was held in Linux was *very* small, although some code I've looked > at doesn't seem to bear that out (at least not any more). > I had a few wild thoughts on this topic earlier in the year. I haven't had a chance to do anything with them because people keep on putting bugs in the kernel :) * presume that interrupts are wickedly expensive and we want to minimise them. This is more relevant to low-end (100mbit) NICs. * presume that cross-CPU traffic and cache misses are expensive, and we want to optimise for these. Some avenues for investigation: * Disable the NIC's interrupts at the hardware level when we're doing receive processing. This would be a big performance win on uniprocessor - there's no *point* in taking the Rx interrupt when we're doing protocol processing - we're just going to queue the packet and go back to protocol processing. I think it's also a performance win on SMP. If we're using NIC->CPU bonding then it's basically a UP problem anyway. So it's better to disable the Rx interrupts at the end of the Rx ISR if we have sent something to netif_rx(). At the end of net_rx_action() processing we call back into the driver to see if it has more Rx frames available. If there are, well, we just process them as well, still with hardware interrupts disabled. This is super-quick. If there aren't any Rx packets available, turn on Rx interrupts. Note that this magically fixes the SMP packets-out-of-order problem as well, independent of any NIC<->CPU bonding. We lose the capability to deliver an incoming packet to a different socket on a different CPU while we're doing protocol processing, but is that valuable? A net loss? * Disable Tx interrupt altogether. Gone. Dead. Instead, do the tx descriptor reaping within the driver's start_xmit method. Also within the (now very occasional) Rx interrupt. This would have to be backed up with a timer of some sort. I expect that a one millisecond timer would be sufficiently short to avoid screwing up TCP. You'd keep pushing it back in time each time you reaped some Tx descriptors, so under heavy load it would never fire. If the timer _does_ fire then you can assume that there isn't much network load and it may be best to reenable Tx interrupts just so you can turn the 1 kHz timer interrupt off. * Poll for Tx descriptor reaping in the Rx interrupt. Poll for Rx packets in the start_xmit method. Save interrupts. With the above two tricks, we get *zero* interrupts per packet under heavy load. "Ah-ha!", you say, "what about latency?". Well, yes, this scheme introduces up to one millisecond latency in the very specific case where traffic is falling from a high level to a low one, which may make it inappropriate for some classes of LAN application, but I suspect that the effects will be low. Plus there are a number of things here which *decrease* latency, such as reducing the interrupt count under load. * Dynamic interrupt bonding. Some very brief testing on a 2-way indicates that TCP is a little more efficient when you hardwire the NIC to the CPU. I was thinking of a simple heuristic where you simply keep track of which CPU sends the most packets in a one-second time period. At the end of that period, subject to some hysteresis and thresholding, bond the NIC's interrupt to that CPU. Repeat each second. This assumes that a preponderation of Tx packet count correlates with one of Rx packet count, which seems fairly sane to me. Note that this scheme (and many other bonding schemes) will come horridly unstuck if multiple NICs are sharing the same interrupt! Don't do that. One thing which concerns me about _any_ scheme which involves dynamic APIC reprogramming is that wierd things are likely to happen if we reprogram APICs when we're under load. PCs are crap, and we're already subject to a worrisome number of strange APIC problems. Trying to give the APIC a brain transplant while it is handling 5,000 interrupts per second seems like a recipe for problems. Last time I looked, Alphas didn't have APICs. We need to design a sensible architecture-neutral interrupt bonding API (or at least a queryable one) before we run off making x86-specific changes. As a footnote, and I know this won't be a popular view on lse-tech - philosophically speaking I believe that 2.4 has given enough to the big-end guys. I hope that in 2.5, more emphasis and kernel developer talent will be devoted to the other 99.99% of Linux users. Better device support, plug-n-play, manageability, upgradability, etc. Linux seems to be becoming more and more a server OS lately and I'd like to see that turned around. Of course the three-letter corps need the scalability. Good luck to them and thanks for supporting Linux. For the privateers, yes, it's *fun* to make Linux faster and it is gratifying, but we need to be aware that it is also *easy*. Solving the problems which are faced by the wider community of Linux users is going to be dull, and hard. - |
From: jamal <ha...@cy...> - 2000-12-25 00:16:42
|
I apologize for the long email. Blame Andrew Morton ;-> Since the OLS, Robert Olson and myself have been looking different schemes to improve what i presented. Summary: Reported < 80Kpps at OLS for SMP. For those who missed the presentation look at: http://robur.slu.se/Linux/net-development/jamal/FF-html/ A few tweaks on the same code and the numbers have gone upto 110Kpps on uni-processor and 130Kpps on SMP. The main peering routers at slu.se PCs running these patches and peering via gated/BGP. Robert has seen in the upwards of 200Kpps on GigE without any tweaks. We'll be presenting more precise results at NORD-USENIX in february. We have basically exceeded the promise at OLS to do 100Mbps wire speed (~148Kpps) by the end of the year. In the meantime, we have each come up with slightly different schemes that we hope will take us to that next level. We've also discovered work done by Robert Morris at MIT on clip which is yet a third approach, fascinating (except for the C++ part ;-<). Robert Morris says he is able to do 333Kpps on a dual PCI SMP machine. ( it would be nice to get access to a beast like that). In my set of experiments (done a few months ago), i have been able to get 140Kpps on a single processor and upto 190Kpps on SMP with IRQ affinity. I belive these numbers can go higher with proper driver tuning but will peak at some point mostly because of the APIC. Gerrit Huizenga and I had a conversation at the OLS in which he pointed out that Linux blindly does RR on receiving interupts and hands them to the next CPU on the list. I have been informed that this is infact a property of the APIC ;-> I believe i am being bitten by this at the moment. I am hoping to test his patches at some point (unfortunately this christmas break i have a more important/exciting project i am working on). On Sun, 24 Dec 2000, Andrew Morton wrote: > Gerrit Huizenga wrote: > > > > Andi Kleen wrote: > > > > > You think it wouldn't help for database servers ? (e.g. with a NIC and a > > > SCSI controller per CPU) > > > > > > -Andi > > > > I think NIC to CPU binding would simply increase the latency problem > > for interrupt delivery in this case. Allowing the APIC to direct an > > interrupt to the first available CPU decreases the average interrupt > > delivery latency. And, I'd guess that the interrupt latency more > > likely governs the throughput than the sharing of a few cache lines ( 1 > > / nprocessors) of the time on most modern SMP systems. > > > > Of course, this depends a lot on how long a lock is held (lock_irq()). > > I had heard that the number of instructions generally that a lock > > was held in Linux was *very* small, although some code I've looked > > at doesn't seem to bear that out (at least not any more). Sounds logical. I think being able to dynamically route IRQ under some policy mechanism (such as the one available for IRQ affinity right now) would help a great deal. > > I had a few wild thoughts on this topic earlier in the year. I haven't > had a chance to do anything with them because people keep on putting > bugs in the kernel :) > > * presume that interrupts are wickedly expensive and we want to > minimise them. This is more relevant to low-end (100mbit) NICs. > > * presume that cross-CPU traffic and cache misses are expensive, and > we want to optimise for these. > OLS solution was interupt mitigation: Cache amortization is achieved because you grab many packets per unit time (as well reduced interupts are a given). This does not reduce the x-CPU traffic, of course, but there was none introduced to start with. > Some avenues for investigation: > > * Disable the NIC's interrupts at the hardware level when we're doing > receive processing. > > This would be a big performance win on uniprocessor - there's no > *point* in taking the Rx interrupt when we're doing protocol > processing - we're just going to queue the packet and go back to > protocol processing. > If i understand you correctly, you are saying that when you are procesing packets from the backlog, you shutdown every NICs rx interupt. > I think it's also a performance win on SMP. If we're using > NIC->CPU bonding then it's basically a UP problem anyway. > I think the better idea is to totaly avoid having to do NIC->CPU bonding and still achieve very good results. > So it's better to disable the Rx interrupts at the end of the Rx > ISR if we have sent something to netif_rx(). At the end of > net_rx_action() processing we call back into the driver to see if it > has more Rx frames available. If there are, well, we just process > them as well, still with hardware interrupts disabled. This is > super-quick. If there aren't any Rx packets available, turn on > Rx interrupts. OK. this is definetely a 4th way of doing things. I am not sure how you are going to solve the fairness issue totaly but it is definetely a different way of doing things. > > Note that this magically fixes the SMP packets-out-of-order problem > as well, independent of any NIC<->CPU bonding. > I dont see how you are going to achieve this. But lately packet reodering has become a lesser issue (at least for TCP). > We lose the capability to deliver an incoming packet to a different > socket on a different CPU while we're doing protocol processing, but > is that valuable? A net loss? What do you mean by socket on different CPU? > > * Disable Tx interrupt altogether. Gone. Dead. > From my experiments, this is a very bad idea. Robert has also ended with the same conclusion on a different test. The problem is the system timer granularity. I dont think you can be more accurate than a transmit interupt event ;-> > Instead, do the tx descriptor reaping within the driver's > start_xmit method. Also within the (now very occasional) Rx > interrupt. > I thought this was common. Maybe only on the tulip (or maybe our patched version). > This would have to be backed up with a timer of some sort. I > expect that a one millisecond timer would be sufficiently short to > avoid screwing up TCP. You'd keep pushing it back in time each time > you reaped some Tx descriptors, so under heavy load it would never > fire. > > If the timer _does_ fire then you can assume that there isn't much > network load and it may be best to reenable Tx interrupts just so you > can turn the 1 kHz timer interrupt off. Sounds very complicated really. But i think the key is experimentation. > > * Poll for Tx descriptor reaping in the Rx interrupt. Poll for Rx > packets in the start_xmit method. Save interrupts. With the above > two tricks, we get *zero* interrupts per packet under heavy load. > > "Ah-ha!", you say, "what about latency?". Well, yes, this scheme > introduces up to one millisecond latency in the very specific case > where traffic is falling from a high level to a low one, which may > make it inappropriate for some classes of LAN application, but I suspect > that the effects will be low. Plus there are a number of things here > which *decrease* latency, such as reducing the interrupt count under > load. > Again, i think the key is experimentation. You have generally come up with a 4th way of doing things. The more people try different schemes, the better. I would say implement then come up with numbers. Lets do it the _old_ IETF way: "we believe in running code" > * Dynamic interrupt bonding. > I like this idea very much. Even more i like the idea of also maintaining the current softnet scheme of things where you have multiple concurent softirqs, one on each processor. > Some very brief testing on a 2-way indicates that TCP is a little > more efficient when you hardwire the NIC to the CPU. > Which kernel was this? And what throughputs were you experimenting with? > I was thinking of a simple heuristic where you simply keep track of > which CPU sends the most packets in a one-second time period. At the > end of that period, subject to some hysteresis and thresholding, bond > the NIC's interrupt to that CPU. Repeat each second. > > This assumes that a preponderation of Tx packet count correlates > with one of Rx packet count, which seems fairly sane to me. > > Note that this scheme (and many other bonding schemes) will come > horridly unstuck if multiple NICs are sharing the same interrupt! > Don't do that. > > One thing which concerns me about _any_ scheme which involves dynamic > APIC reprogramming is that wierd things are likely to happen if we > reprogram APICs when we're under load. PCs are crap, and we're already > subject to a worrisome number of strange APIC problems. Trying to give > the APIC a brain transplant while it is handling 5,000 interrupts per > second seems like a recipe for problems. > I think some load balancing heuristic is needed. I think the heurusitic should _not_ be based on counting packets only but rather on CPU load. For example, if your IDE is trashing a lot of interupts then you want to take this into account as well. As well if a CPU is running a lot of user processes you need to take into account those issues. But definetely some form of dynamic IRQ routing is a good idea. Sounds like a very exciting project (and a conference paper). From my conversation with Gerrit they seem to have solved this. My knowledge of the APIC is very sparse and i dont have time at the moment. Dynamic IO-APIC reprogramming, if it can be done very efficiently is a definete win. But like i said i dont have the knowledge there and i believe in numbers. Seems Gerrit and co had some scheme of figuring which CPU is least loaded and handing the interupt to it. Also, i am not sure how the currently highly parallel scalable softnet scheme is going to be maintained. > Last time I looked, Alphas didn't have APICs. We need to design a > sensible architecture-neutral interrupt bonding API (or at least a > queryable one) before we run off making x86-specific changes. How is IRQ affinity achieved on Alphas then? ;-> I thought this was a common thing that comes in the PCI package. And if they cant do IRQ affinity, i would say they deserve to miss this boat as well. > > As a footnote, and I know this won't be a popular view on lse-tech - BTW, what is this LSE list? I just joined it. > philosophically speaking I believe that 2.4 has given enough to the > big-end guys. I hope that in 2.5, more emphasis and kernel developer > talent will be devoted to the other 99.99% of Linux users. Better > device support, plug-n-play, manageability, upgradability, etc. Linux > seems to be becoming more and more a server OS lately and I'd like to > see that turned around. > > Of course the three-letter corps need the scalability. Good luck to > them and thanks for supporting Linux. Andrew, I am basically in agreement with you on this, But, realistically: do you think these three letter corps care about anything that doesnt serve them? I dont see SGI trying to help make Linux more user friendly or even the justification from their corporate perspective. This also is going to affect all other "traditional" Linux companies. In fact it might be already. It is ok to let them serve their corporate interests as long as they help Linux. It is healthy _as long as_ there are no competing goals at the kernel level, with one of them getting into the kernel. Competing goals will result in Linux forks. This might not be the case for user space (Gnome vs KDE) unless the case involves exposing some APIs from the kernel. > For the privateers, yes, it's > *fun* to make Linux faster and it is gratifying, but we need to be > aware that it is also *easy*. Solving the problems which are faced by > the wider community of Linux users is going to be dull, and hard. > I am not sure if *easy* is correct description here. Fun, yes. That is why i participate (and maybe you as well). And if it is fun by definition it means i do what i like. I think a combination of people like us results in an overall improved Linux as long as there are not too many overlaps. And even overlaps might not be a bad idea if you have plenty of time. I find Gnome development boring, dull, and hard (not that i cant do it if you pointed a gun at me). I am sure the Gnome people think the same about what i do. cheers, jamal |
From: Tim W. <ti...@sp...> - 2000-12-25 00:16:40
|
Hi Jamal, Thanks for the info and the paper pointer. This looks very interesting. I've added a couple of comments below... On Sun, Dec 24, 2000 at 12:03:27PM -0500, jamal wrote: [...] > > I think some load balancing heuristic is needed. I think the heurusitic > should _not_ be based on counting packets only but rather on CPU load. > For example, if your IDE is trashing a lot of interupts then you want to > take this into account as well. As well if a CPU is running a lot of user > processes you need to take into account those issues. > That's the rationale/approach we took in DYNIX/ptx and it certainly works well with large numbers of processors and high loads. > But definitely some form of dynamic IRQ routing is a good idea. Sounds > like a very exciting project (and a conference paper). From my > conversation with Gerrit they seem to have solved this. My knowledge of the > APIC is very sparse and i dont have time at the moment. > Dynamic IO-APIC reprogramming, if it can be done very efficiently is a > definete win. But like i said i dont have the knowledge there and i > believe in numbers. > Seems Gerrit and co had some scheme of figuring which CPU is least loaded > and handing the interupt to it. The scheme is that you tell the APIC what your current priority is. The APIC has a task priority register, but Linux doesn't use it. We just set it to accept-all at boot time and leave it alone. If you use it to indicate your current priority, the APIC bus will deliver the interrupt to the least-loaded CPU. The RR behaviour (yes I'm a Brit :-) happens if there's a choice of "least loaded". > > > Last time I looked, Alphas didn't have APICs. We need to design a > > sensible architecture-neutral interrupt bonding API (or at least a > > queryable one) before we run off making x86-specific changes. > > How is IRQ affinity achieved on Alphas then? ;-> I thought this was a > common thing that comes in the PCI package. And if they cant do IRQ > affinity, i would say they deserve to miss this boat as well. > I don't know what the interrupt controller architecure is on the Alpha, but I suspect it isn't primitive. > > > > As a footnote, and I know this won't be a popular view on lse-tech - > > BTW, what is this LSE list? I just joined it. > It's the (a?) "Linux Scalability Effort". In fact, I'm about to post a roadmap document which tries to give a better feel for what we wanted to achieve here. In a nutshell, it's "what can we do to make Linux run better on big iron, *without* breaking it on uni-proc/embedded systems". > > philosophically speaking I believe that 2.4 has given enough to the > > big-end guys. I hope that in 2.5, more emphasis and kernel developer > > talent will be devoted to the other 99.99% of Linux users. Better > > device support, plug-n-play, manageability, upgradability, etc. Linux > > seems to be becoming more and more a server OS lately and I'd like to > > see that turned around. > > > > Of course the three-letter corps need the scalability. Good luck to > > them and thanks for supporting Linux. > There are people in the Linux Technology Center at IBM working on a number of the above. It just so happens that people like Gerrit and myself work for the company formally known as Sequent (we were bought by IBM), and hence our area of expertise is large-scale SMP and NUMA systems. > > For the privateers, yes, it's > > *fun* to make Linux faster and it is gratifying, but we need to be > > aware that it is also *easy*. Solving the problems which are faced by > > the wider community of Linux users is going to be dull, and hard. > > > > I am not sure if *easy* is correct description here. Fun, yes. That is > why i participate (and maybe you as well). And if it is fun by definition > it means i do what i like. I think a combination of people like us results > in an overall improved Linux as long as there are not too many overlaps. > And even overlaps might not be a bad idea if you have plenty of time. > I find Gnome development boring, dull, and hard (not that i cant do it if > you pointed a gun at me). I am sure the Gnome people think the same about > what i do. > Indeed. I'm unconvinced that working on the kernel is easy, at least not in comparison to programing in userland. There are people who are scared witless a the thought of kernel work, but are totally at home (and highly) skilled working on applications. I do think there is a great deal more work to be done in userland then on the kernel, and I sincerely hope that enough people can be found to do it. It's just not my area of expertise. Regards, Tim -- Tim Wright - ti...@sp... or ti...@ar... or tw...@us... IBM Linux Technology Center, Beaverton, Oregon "Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI |
From: jamal <ha...@cy...> - 2000-12-25 03:00:22
|
On Sun, 24 Dec 2000, Tim Wright wrote: > The scheme is that you tell the APIC what your current priority is. The APIC > has a task priority register, but Linux doesn't use it. We just set it to > accept-all at boot time and leave it alone. If you use it to indicate your > current priority, the APIC bus will deliver the interrupt to the least-loaded > CPU. The RR behaviour (yes I'm a Brit :-) happens if there's a choice of > "least loaded". So it seems that the process is capable of setting a high enough priority such that an (hardware) interupt wont run on a specific CPU? My goal might be slightly different than this. I would like to route interupts to CPUs which have the "least load of interupts" in addition to process load. do you have a paper on this, maybe you can already do this? cheers, jamal |
From: Tim W. <ti...@sp...> - 2000-12-25 08:59:45
|
On Sun, Dec 24, 2000 at 09:59:34PM -0500, jamal wrote: > > > On Sun, 24 Dec 2000, Tim Wright wrote: > > > The scheme is that you tell the APIC what your current priority is. The APIC > > has a task priority register, but Linux doesn't use it. We just set it to > > accept-all at boot time and leave it alone. If you use it to indicate your > > current priority, the APIC bus will deliver the interrupt to the least-loaded > > CPU. The RR behaviour (yes I'm a Brit :-) happens if there's a choice of > > "least loaded". > > So it seems that the process is capable of setting a high enough priority > such that an (hardware) interupt wont run on a specific CPU? No, all user processes are pre-emptible. DYNIX/ptx doesn't support the notion of RT processes. The range of user priorities is crunched down into 4 bits. Locking out of interrupts below a given SPL is achieved by programming a different register IIRC, although it is also mirrored in the TPR (task priority register), and SPL1 is "higher" priority than any user process. > My goal might be slightly different than this. I would like to route > interupts to CPUs which have the "least load of interupts" in addition > to process load. > do you have a paper on this, maybe you can already do this? > That is what we do. I don't know if we have a paper, but it sounds like we should. It's actually quite likely that I can publish the relevant code with little difficulty. Regards, Tim -- Tim Wright - ti...@sp... or ti...@ar... or tw...@us... IBM Linux Technology Center, Beaverton, Oregon "Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI |
From: jamal <ha...@cy...> - 2000-12-25 16:14:05
|
On Mon, 25 Dec 2000, Tim Wright wrote: > On Sun, Dec 24, 2000 at 09:59:34PM -0500, jamal wrote: > > > > > > So it seems that the process is capable of setting a high enough priority > > such that an (hardware) interupt wont run on a specific CPU? > > No, all user processes are pre-emptible. DYNIX/ptx doesn't support the notion > of RT processes. The range of user priorities is crunched down into 4 bits. > Locking out of interrupts below a given SPL is achieved by programming a > different register IIRC, although it is also mirrored in the TPR (task > priority register), and SPL1 is "higher" priority than any user process. I see. I think i understand what Andi was saying earlier. You are generally proposing some bsdish SPL* scheme which also applies to hardware interupts(?). i.e it looks to me with your idea, you might lower the priority of a hardware interupt. Seems like our goals might intersect somewhere b ut not as simple as i thought earlier. In essence, i am looking for some way to define sophisticated interupt scheduling via a policy. Simple example of what i'd like to experiment with: A policy such as : "IRQ3 and IRQ5 are mutually exclusive". The mechanism should then ensure that at any point in time if IRQ3 and IRQ5 get set they get routed to two different CPUs. > That is what we do. I don't know if we have a paper, but it sounds like we > should. It's actually quite likely that I can publish the relevant code > with little difficulty. > Code would help. cheers, jamal |
From: Tim W. <ti...@sp...> - 2000-12-25 17:54:55
|
On Mon, Dec 25, 2000 at 11:13:11AM -0500, jamal wrote: > > > On Mon, 25 Dec 2000, Tim Wright wrote: > > > On Sun, Dec 24, 2000 at 09:59:34PM -0500, jamal wrote: > > > > > > > > > So it seems that the process is capable of setting a high enough priority > > > such that an (hardware) interupt wont run on a specific CPU? > > > > No, all user processes are pre-emptible. DYNIX/ptx doesn't support the notion > > of RT processes. The range of user priorities is crunched down into 4 bits. > > Locking out of interrupts below a given SPL is achieved by programming a > > different register IIRC, although it is also mirrored in the TPR (task > > priority register), and SPL1 is "higher" priority than any user process. > > I see. I think i understand what Andi was saying earlier. You are > generally proposing some bsdish SPL* scheme which also applies to > hardware interupts(?). i.e it looks to me with your idea, you might > lower the priority of a hardware interupt. The SPL stuff predates BSD by a long way. It was in 6th edition and probably earlier than that. I wouldn't say the we "lower" the priority, but we do have a hierarchy such that there are lower priority interrupt routines that can be pre-empted by higher priority ones. However, it is not necessary to go to these lengths to take advantage of the APIC priority stuff. It is nowehere written in stone that there have to be eight interrupt priority levels. If we were to choose two, for instance, that collapses down to map simply to what Linux uses today, except that by replacing cli/sti we would achieve sane interrupt distribution. > Seems like our goals might > intersect somewhere b ut not as simple as i thought earlier. > In essence, i am looking for some way to define sophisticated interupt > scheduling via a policy. > I suspect we're closer together than you might think. The SPL hierarchy thing may turn out to be less valuable than the complexity justifies. The first thing to do is to change the APIC programming. > Simple example of what i'd like to experiment with: > A policy such as : "IRQ3 and IRQ5 are mutually exclusive". The mechanism > should then ensure that at any point in time if IRQ3 and IRQ5 get set > they get routed to two different CPUs. > I'd need to think about that. I'm not sure how practical that is on an x86. Of course, the way it's done in ptx would guarantee that provided that at least one cpu in the system wasn't already busy in an interrupt routine. > > That is what we do. I don't know if we have a paper, but it sounds like we > > should. It's actually quite likely that I can publish the relevant code > > with little difficulty. > > > > Code would help. > Will work on it.... Watch this space :-) Regards, Tim -- Tim Wright - ti...@sp... or ti...@ar... or tw...@us... IBM Linux Technology Center, Beaverton, Oregon "Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI |
From: John W. <jw...@en...> - 2000-12-26 20:55:28
|
On Sun, Dec 24, 2000 at 12:03:27PM -0500, jamal wrote: > On Sun, 24 Dec 2000, Andrew Morton wrote: > > As a footnote, and I know this won't be a popular view on lse-tech - > > philosophically speaking I believe that 2.4 has given enough to the > > big-end guys. I hope that in 2.5, more emphasis and kernel developer > > talent will be devoted to the other 99.99% of Linux users. Better > > device support, plug-n-play, manageability, upgradability, etc. Linux > > seems to be becoming more and more a server OS lately and I'd like to > > see that turned around. > > > > Of course the three-letter corps need the scalability. Good luck to > > them and thanks for supporting Linux. > > Andrew, I am basically in agreement with you on this, But, realistically: > do you think these three letter corps care about anything that > doesnt serve them? I dont see SGI trying to help make Linux more user > friendly or even the justification from their corporate perspective. This > also is going to affect all other "traditional" Linux companies. In fact > it might be already. It is ok to let them serve their corporate interests > as long as they help Linux. It is healthy _as long as_ there are no > competing goals at the kernel level, with one of them getting into the > kernel. Competing goals will result in Linux forks. This might not be the > case for user space (Gnome vs KDE) unless the case involves exposing some > APIs from the kernel. Since I'm from one of the 3 lettered corps, I'll make a short statement here. First of all is that this project isn't just about big iron. It is also about big workloads. Meaning that we are interested not only in large system configurations but smaller systems that are heavily loaded. For SGI, we sell at both ends of this spectrum. There are two other places that this project needs to be considerate to: 1) interactive systems (I think the desktop systems fall into here as they normally aren't doing much but need interactive response time from the apps) and 2) embedded systems. I have plenty of people internally and externally that will kill me if I hurt #1. Also I use linux on all my personal systems so I would kind of be shooting myself in the foot. But my opinion is that the interactivity issues won't be too hard... They usually align with making the big workloads run well. The embedded issue means that we need to pay careful attention in our work to keeping the size of structures down. This doesn't necessarily mean that that embedded issues are at odds with scalability... Usually you can be clever and make your code work nicely on minimal configuarations and use tunables to work on larger systems... It is just something that many companies would rather just say "buy more memory", but with the embedded issue, we won't be able to get away with that. As far as SGI being interested in usability, managability, user-friendliness etc. We don't have the amount of people working on desktop/apps as we once had... that being said we do have a few resources working on things like KDE/GNOME/OpenOffice/Mozilla. Some of this is porting to IRIX but some is also general development. But I would say you are right in that we do not have a business justification to lead those development efforts... but we will contribute where we can. But for system usability, managability, etc... We do have a strong interest there. And I think we are working in those directions. And for device support, we have contributed to drivers for the equiptment that we ship. I don't think any vendor can do any more than that. jwright - John Wright - SGI | Email: jw...@en... Scalable Linux Manager | Voice: (650) 933-8899 1200 Crittenden Lane MS:30-3-500 | Pager: (650) 254-9296 Mountain View, CA 94043 | Alpha page: jwr...@pa... |
From: Gerrit H. <ge...@us...> - 2000-12-23 00:42:03
|
Andi Kleen wrote: > I think the main difference from this approach to yours is that it tries > to minimize cross CPU traffic and not only load on the CPUs (assuming that > the CPUs are much faster than the bus). The scheme could probably be extended to > other IO too, but it is not clear if it is worth there as much because block IO > tends to have much less state than networking. > > -Andi Andi, in addition to CPU affinity in DYNIX/ptx, we put some effort into identifying datea that would normally be shuttled between CPUs and using various techniques to avoid shared data updates. For instance, in the TCP, IP and scheduler subsystems, we kept per-CPU statistics, which are one of the more frequently updated pieces of normally global data. When necessary, or periodically, those per-CPU statistics would be summed into a global structure. We also used a read-copy-update lock to avoid a lock round trip on data reads for things like the IP routing table, limiting writes to any global data (lock) location. In the block IO space (e.g. SCSI, Fibrechannel), we typically had per- controller data structures. On a NUMA system, those per-controller structures resided in memory on that node, and were almost exclusively access/updated by that node. This avoided quite a bit of cross-node traffic. Also in the block IO space, we used NUMA-aware IO scheduling and a multipath IO fabric (usually at least one path to each block device from each node) and ensured that IO traffic was rarely sent across a node boundary (avoiding the NUMA interconnect). We did not worry directly about ensuring that a single processor was bound to a single controller (e.g. we could have 30 processors on our SMP system, with 4 NICs, 8 SCSI controllers - how would you do "fair" binding, even manually?) and instead focused on keeping cache line utilization effective. Specifically, this included making sure that cache lines either contained all read-only data, or that data would be accessed by only one CPU at a time because of some higher level locking. As a result, using the old Sequent SLIC (grandfather to the APIC) and later the APIC/IO-APIC to set priority levels helped ensure that the system scaled smoothly and that interrupt latency did not increase simply because another interrupt context or locking context held a specific CPU with cli. I've also been reviewing some of Jamal Hadi-Salim's work with NICs on 2 processor systems. His results seem to indicate that APIC binding to a CPU *does* throttle throughput even on a two processor system. We are hoping to work with him to verify that theory sometime during the next few months. However, as the number of CPUs increases to 3, 4, or more, we are reasonably sure, based on our experiences with DYNIX/ptx, that setting the APIC priority value will have a substantial benefit over both the status quo and interrupt binding. gerrit |
From: Andi K. <ak...@su...> - 2000-12-23 04:47:47
|
On Fri, Dec 22, 2000 at 04:41:52PM -0800, Gerrit Huizenga wrote: > > For instance, in the TCP, IP and scheduler subsystems, we kept per-CPU > statistics, which are one of the more frequently updated pieces of > normally global data. When necessary, or periodically, those per-CPU > statistics would be summed into a global structure. Linux 2.4 does exactly this already ;) There is just no global structure, all statistics users have to collect from all the CPUs. The 2.4 networking architecture -- softnet -- tries very hard to keep all data CPU local. We even (probably re)invented things like a big reader lock for common locks like the ip protocol list lock, which is readwrite lock where each CPU has an own reader lock and the very infrequent writer has to aquire all of them. So far it seems to be successfull, for upto 8 CPUs it scales near linearly for some tests at least (more CPUs have not been tried) > We also used a read-copy-update lock to avoid a lock round trip on > data reads for things like the IP routing table, limiting writes to > any global data (lock) location. That's currently one of the remaining bottle necks in network for local traffic. Linux allocates a dst_entry for every incoming packet, aquiring the lock of its hash bucket and increasing an reference count. Unfortunately for local traffic all the packets tend to bang on the same dst_entry for your local IP, causing cross CPU traffic for the refcnt and the lock. Giving each NIC an own IP would be faster here, assuming the IPs do not collision in the routing cache hash table. Linux is more optimized for routing than for local traffic here, which is probably wrong. Who needs MP routers anyways? > > In the block IO space (e.g. SCSI, Fibrechannel), we typically had per- > controller data structures. On a NUMA system, those per-controller > structures resided in memory on that node, and were almost exclusively > access/updated by that node. This avoided quite a bit of cross-node > traffic. I'm sure the block IO and SCSI layers could use much improvement here, they're not very scalable right now. > Also in the block IO space, we used NUMA-aware IO scheduling and a > multipath IO fabric (usually at least one path to each block device > from each node) and ensured that IO traffic was rarely sent across > a node boundary (avoiding the NUMA interconnect). What exactly was scheduled? It sounds like a similar problem to the dynamic NIC-IRQ binding. > > We did not worry directly about ensuring that a single processor was > bound to a single controller (e.g. we could have 30 processors on our > SMP system, with 4 NICs, 8 SCSI controllers - how would you do "fair" > binding, even manually?) and instead focused on keeping cache line > utilization effective. Specifically, this included making sure that > cache lines either contained all read-only data, or that data would > be accessed by only one CPU at a time because of some higher level > locking. That's the goal with the dynamic IRQ-NIC binding. Just in 2.4 the only way for an administrator to tune would be to do static binding, hopefully 2.5 can offer a dynamic solution. > I've also been reviewing some of Jamal Hadi-Salim's work with NICs > on 2 processor systems. His results seem to indicate that APIC > binding to a CPU *does* throttle throughput even on a two processor > system. We are hoping to work with him to verify that theory > sometime during the next few months. However, as the number of > CPUs increases to 3, 4, or more, we are reasonably sure, based on > our experiences with DYNIX/ptx, that setting the APIC priority > value will have a substantial benefit over both the status quo > and interrupt binding. Interesting. I'm looking forward to see your numbers. -Andi |
From: Andi K. <ak...@su...> - 2000-12-21 22:16:41
|
On Thu, Dec 21, 2000 at 01:24:16PM -0800, Tim Wright wrote: > Hi Andi, > sounds like we should be talking. What you say is exactly in line with some > work we have planned here. We're currently looking to prototype use of > APIC priority programming here in Beaverton. > > Simply put, in DYNIX/ptx we always set the local APIC priority to indicate > what the processor is doing. Doing this means that interrupts always get > delivered to the least loaded (preferably) idle CPU. Could you expand a bit on what these states are in dynix ? Is it just like BSD spl levels, or more complicated ? -Andi |
From: Tim W. <ti...@sp...> - 2000-12-21 22:37:54
|
Sorry, yes. Just like the traditional Unix interrupt priority levels. In DYNIX/ptx at least, we even have the same number of levels viz SPL0 through SPL7 (aka splhi :-). Obviously, any spinlocks that can be used in interrupt context have to disable interrupts at the local CPU to prevent a livelock when the interrupt routine happens to land on a CPU that already has the lock. We maintained the interrupt priority level hierarchy because it allows you greater flexibility in who you give the best latency to. Again, it might be argued that this introduces needless complexity, but it would be interesting to test and see. Tim On Thu, Dec 21, 2000 at 11:16:33PM +0100, Andi Kleen wrote: > On Thu, Dec 21, 2000 at 01:24:16PM -0800, Tim Wright wrote: > > Hi Andi, > > sounds like we should be talking. What you say is exactly in line with some > > work we have planned here. We're currently looking to prototype use of > > APIC priority programming here in Beaverton. > > > > Simply put, in DYNIX/ptx we always set the local APIC priority to indicate > > what the processor is doing. Doing this means that interrupts always get > > delivered to the least loaded (preferably) idle CPU. > > Could you expand a bit on what these states are in dynix ? Is it just like > BSD spl levels, or more complicated ? > > -Andi -- Tim Wright - ti...@sp... or ti...@ar... or tw...@us... IBM Linux Technology Center, Beaverton, Oregon "Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI |
From: Andi K. <ak...@su...> - 2000-12-21 22:51:42
|
On Thu, Dec 21, 2000 at 02:37:39PM -0800, Tim Wright wrote: > Sorry, yes. > Just like the traditional Unix interrupt priority levels. In DYNIX/ptx at least, > we even have the same number of levels viz SPL0 through SPL7 (aka splhi :-). > > Obviously, any spinlocks that can be used in interrupt context have to disable > interrupts at the local CPU to prevent a livelock when the interrupt routine > happens to land on a CPU that already has the lock. We maintained the interrupt > priority level hierarchy because it allows you greater flexibility in who you > give the best latency to. Again, it might be argued that this introduces > needless complexity, but it would be interesting to test and see. I doubt that there would be many chances to get generally visible spl levels into Linux -- Torvalds et.al. have a near religious aversion against them. I am also not sure I understand how the spl level is related to the load of the CPU. Do you have a special SPL level for the idle thread ? I suspect you could just set the APIC in the idle thread in Linux, to redirect interrupts to idle cpus, but this would only be a win when the interrupt is more costly than the cost of transfering the context changed by the CPU to the final CPU that runs the consumer. Is that true for all the NUMA/SMP boxes? -Andi |
From: Tim W. <ti...@sp...> - 2000-12-22 01:13:17
|
On Thu, Dec 21, 2000 at 11:51:39PM +0100, Andi Kleen wrote: > I doubt that there would be many chances to get generally visible spl levels into > Linux -- Torvalds et.al. have a near religious aversion against them. > Yes, I sort of gathered that. I've not seen any explanation of this antipathy. As I say, it may turn out to be of little or no use given that added complexity, but I think that it would be better to find that out rather than assume it :-) > I am also not sure I understand how the spl level is related to the load of > the CPU. Do you have a special SPL level for the idle thread ? > No, not as such. If you're running at elevated interrupt priority, the APIC TPR (Task Priority Register) is set to a corresponding value (higher than any values valid at SPL0). If you are running outside of interrupts however, the scheduler records the priority in the low 4 bits of the TPR via a routine called apic_pri(). We actually have to crunch 128 levels down to 16, but it still means that the idle or less busy CPUs are more likely to interrupted. > I suspect you could just set the APIC in the idle thread in Linux, to redirect > interrupts to idle cpus, but this would only be a win when the interrupt > is more costly than the cost of transfering the context changed by the CPU > to the final CPU that runs the consumer. > Is that true for all the NUMA/SMP boxes? > That's our assumption. Of course, it may be wrong. It is worth considering that a number of these design decisions were may quite a few years ago and that processor clock speed/cache speed/main memory speed/disk speed ratios have not remained constant in the interim. At least we'd like to look and see if there's a "better" way of doing for larger systems. Tim -- Tim Wright - ti...@sp... or ti...@ar... or tw...@us... IBM Linux Technology Center, Beaverton, Oregon "Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI |
From: Andi K. <ak...@su...> - 2000-12-22 01:38:36
|
On Thu, Dec 21, 2000 at 05:13:08PM -0800, Tim Wright wrote: > On Thu, Dec 21, 2000 at 11:51:39PM +0100, Andi Kleen wrote: > > I doubt that there would be many chances to get generally visible spl levels into > > Linux -- Torvalds et.al. have a near religious aversion against them. > > > > Yes, I sort of gathered that. I've not seen any explanation of this antipathy. One argument is e.g. http://www.usenix.org/publications/library/proceedings/ana97/full_papers/small/small.html -Andi |