From: Davide L. <da...@xm...> - 2002-10-29 20:54:06
|
On Tue, 29 Oct 2002, John Gardiner Myers wrote: > >I bet Davide knows best. > > > Nope, he doesn't. It is very easy for me to remain calm here. You're a funny guy. You're in the computer science by many many years and still you're not able to understand how edge triggered events works. And look, this apply to every field, form ee to cs. Book suggestions would be requested here, but since I believe grasping inside a technical library to be pretty fun, I'll leave you this pleasure. > >An easy solution is to have sys_epoll_ctl check if there is there is data > >ready and make sure there is an edge to report in that case to the next call > >of sys_epoll_ctl(). > > > > > This is the very solution I am proposing. This is an example snippet code that can be used with the current API : for(;;) { nfds = sys_epoll_wait(kdpfd, &pfds, -1); for(n = 0; n < nfds; ++n) { if(fd = pfds[n].fd) == s) { client = accept(s, (struct sockaddr*)&local, &addrlen); if(client < 0){ perror("accept"); continue; } if (sys_epoll_ctl(kdpfd, EP_CTL_ADD, client, POLLIN | POLLOUT) < 0) { fprintf(stderr, "sys_epoll set insertion error: fd=%d\n", client); return -1; } fd = client; } do_use_fd(fd); } } This is what will be used in case of your failing-to-understand-edge-triggered-api method : for(;;) { nfds = sys_epoll_wait(kdpfd, &pfds, -1); for(n = 0; n < nfds; ++n) { if(fd = pfds[n].fd) == s) { client = accept(s, (struct sockaddr*)&local, &addrlen); if(client < 0){ perror("accept"); continue; } if (sys_epoll_ctl(kdpfd, EP_CTL_ADD, client, POLLIN | POLLOUT) < 0) { fprintf(stderr, "sys_epoll set insertion error: fd=%d\n", client); return -1; } } else do_use_fd(fd); } } Why the heck ( and this for the 100th time ) do you want to go to wait for an event on the newly born fd if : 1) On connect() you have the _full_ write I/O space available 2) On accept() it's very likely the you'll find something more than a SYN in the first packet Besides, the first code is even more cleaner and simmetric, while adopting your half *ss solution might suggest the user that he can go waiting for events any time he wants. Like going to sleep the the wait queue of IDE disk w/out having issued any command. Now to bring this 101, consider : 1) "issuing a command to an IDE disk" == "using read/write until EAGAIN" 2) "adding yourself on the IDE disk wait queue" == "calling sys_epoll_wait()" PS: since my time is not infinite, and since I'm working on the changes we agreed with Andrew I would suggest you either to take another look at the code suggesting us new changes ( like you did yesterday ) or to go shopping for books. - Davide |
From: Jamie L. <lk...@ta...> - 2002-10-30 00:26:55
|
> 1) "issuing a command to an IDE disk" == "using read/write until EAGAIN" > 2) "adding yourself on the IDE disk wait queue" == "calling sys_epoll_wait()" That is quite a good analogy. epoll is like a waitqueue - which is also like a futex. To use a waitqueue properly you have do these things in the order shown: 1. Set the task state to stopped. 2. Register yourself on the waitqueue. 3. Check the condition. 4. If condition is not met, schedule. With epoll it is very similar. To wait for a condition on a file descriptor, such as readability, you must do these things in the order shown: 1. Register your interest using epoll_ctl. 2. Check the condition by actually calling read(). 3. If the condition is not met (i.e. read() returned EAGAIN), call epoll_wait (i.e. equivalent to schedule). With epoll, you can optimise by registering interest just once. In other words, steps 2 and 3 may be repeated without repeating step 1. And if you are concerned about starvation -- that is, one of your file descriptors always has new data so others don't get a chance to be serviced -- don't be. You don't have to completely read one fd until you see EGAIN. All that matters is that until you see the EAGAIN, your user space data structure should have a flag that says the fd is still readable, so another epoll event is not expected or required for that fd. -- Jamie |
From: Davide L. <da...@xm...> - 2002-10-30 02:00:41
|
On Wed, 30 Oct 2002, Jamie Lokier wrote: > > 1) "issuing a command to an IDE disk" == "using read/write until EAGAIN" > > 2) "adding yourself on the IDE disk wait queue" == "calling sys_epoll_wait()" > > That is quite a good analogy. epoll is like a waitqueue - which is > also like a futex. To use a waitqueue properly you have do these > things in the order shown: > > 1. Set the task state to stopped. > 2. Register yourself on the waitqueue. > 3. Check the condition. > 4. If condition is not met, schedule. > > With epoll it is very similar. To wait for a condition on a file > descriptor, such as readability, you must do these things in the order > shown: > > 1. Register your interest using epoll_ctl. > 2. Check the condition by actually calling read(). > 3. If the condition is not met (i.e. read() returned EAGAIN), > call epoll_wait (i.e. equivalent to schedule). > > With epoll, you can optimise by registering interest just once. In > other words, steps 2 and 3 may be repeated without repeating step 1. > > And if you are concerned about starvation -- that is, one of your file > descriptors always has new data so others don't get a chance to be > serviced -- don't be. You don't have to completely read one fd until > you see EGAIN. All that matters is that until you see the EAGAIN, > your user space data structure should have a flag that says the fd is > still readable, so another epoll event is not expected or required for > that fd. Jamie, can I pay you a beer ? Your comment describe perfectly the API. You can replace read() with write() in your description, and the whole thing is still true. - Davide |
From: John G. M. <jg...@ne...> - 2002-10-30 02:22:42
|
Davide Libenzi wrote: >You're in the computer science by many many years and still you're not able to >understand how edge triggered events works. > Failure to agree does not imply failure to understand. I understand the model you want to apply to this problem, I do not agree that it is the best model to apply to this problem. >Why the heck ( and this for the 100th time ) do you want to go to wait for >an event on the newly born fd if : > >1) On connect() you have the _full_ write I/O space available >2) On accept() it's very likely the you'll find something more than a SYN > in the first packet > >Besides, the first code is even more cleaner and simmetric, while adopting >your half *ss solution might suggest the user that he can go waiting for >events any time he wants. > The first code is hardly cleaner and is definitely not symmetric--the way the accept code has to set up to fall through the do_use_fd() code is subtle. In the first code, the accept segment cannot be cleanly pulled into a callback: for (;;) { nfds = sys_epoll_wait(kdpfd, &pfds, -1); for(n = 0; n < nfds; ++n) { (cb[pfds[n].fd])(pfds[n].fd); } } Also, your first code does not fit your "edge triggered" model--the code for handling 's' does not drain its input. By the time you call accept(), there could be multiple connections ready to be accepted. Your connect() argument is not applicable to "server sends first" protocols. I suspect you are being overly optimistic about the likelihood of getting data with SYN, but whatever. The argument is basically that not delivering an event upon registration (and thus having the event be implicit) improves performance because the socket is going to be ready with sufficiently high probability. I would counter that the cost of explicitly delivering such an event is miniscule compared to the rest of the cost of connection setup and teardown--the optimization is not worthwhile. >Like going to sleep the the wait queue of IDE >disk w/out having issued any command. > The key difference between this interface and wait queues is that with wait queues it is not technically feasible to both register interest and test the condition in a single, atomic operation. epoll does not have this technical limitation, so it can provide a better interface. >PS: since my time is not infinite, and since I'm working on the changes we >agreed with Andrew I would suggest you either to take another look at the >code suggesting us new changes ( like you did yesterday ) or to go >shopping for books. > I am uncomfortable with the way the epoll code adds its own set of notification hooks into the socket and pipe code. Much better would be to extend the existing set of notification hooks, like the aio poll code does. That would reduce the risk of kernel bugs where some subsystem fails to deliver an event to one but not all types of poll notification hooks and it would minimize the cost of the epoll patch when epoll is not being used. > > |
From: Davide L. <da...@xm...> - 2002-10-30 03:42:27
|
On Tue, 29 Oct 2002, John Gardiner Myers wrote: > Failure to agree does not imply failure to understand. I understand the > model you want to apply to this problem, I do not agree that it is the > best model to apply to this problem. John, your first post about epoll was "the interface has a bug, please do not merge it". Now either you have a strange way to communicate non agreement or it is something more than that. Or maybe you wanted just blindly kill the interface with your comments because you're totally committed to another one currently. And the existance of an interface that might work as well, or maybe better in some cases, could create you some problem that I'm unaware about. > >Why the heck ( and this for the 100th time ) do you want to go to wait for > >an event on the newly born fd if : > > > >1) On connect() you have the _full_ write I/O space available > >2) On accept() it's very likely the you'll find something more than a SYN > > in the first packet > > > >Besides, the first code is even more cleaner and simmetric, while adopting > >your half *ss solution might suggest the user that he can go waiting for > >events any time he wants. > > > The first code is hardly cleaner and is definitely not symmetric--the > way the accept code has to set up to fall through the do_use_fd() code > is subtle. In the first code, the accept segment cannot be cleanly > pulled into a callback: > > for (;;) { > nfds = sys_epoll_wait(kdpfd, &pfds, -1); > for(n = 0; n < nfds; ++n) { > (cb[pfds[n].fd])(pfds[n].fd); > } > } Sorry, what prevents you in coding that ? If you, instead of ranting because epoll does not fit your personal idea of event notification, took a look to the example http server used for the test ( coroutine based ) you'll see that does exactly that. Ok, it's a mess because it supports 5 interfaces, all #ifdef'ed, but the concept is there. > Also, your first code does not fit your "edge triggered" model--the code > for handling 's' does not drain its input. By the time you call > accept(), there could be multiple connections ready to be accepted. I really don't believe this. Are you just trolling or what ? It is clear that your acceptor routine has to do a little more work than that in a real program. Again, looking at the example http server might help you. This is what the acceptor coroutine does in such _trivial_ http server : static void *dph_acceptor(void *data) { struct dph_conn *conn = (struct dph_conn *) data; struct sockaddr_in addr; int sfd, addrlen = sizeof(addr); while ((sfd = dph_accept(conn, (struct sockaddr *) &addr, &addrlen)) != -1) { if (dph_new_conn(sfd, dph_httpd) < 0) { dph_close(sfd); } } return data; } and this is dph_accept : int dph_accept(struct dph_conn *conn, struct sockaddr *addr, int *addrlen) { int sfd, flags = 1; while ((sfd = accept(conn->sfd, addr, (socklen_t *) addrlen)) < 0) { if (errno == EINTR) continue; if (errno != EAGAIN && errno != EWOULDBLOCK) return -1; conn->events = POLLIN; co_resume(conn); } if (ioctl(sfd, FIONBIO, &flags) && ((flags = fcntl(sfd, F_GETFL, 0)) < 0 || fcntl(sfd, F_SETFL, flags | O_NONBLOCK) < 0)) { close(sfd); return -1; } return sfd; } and this is dph_new_conn : static int dph_new_conn(int sfd, void *func) { struct dph_conn *conn = (struct dph_conn *) malloc(sizeof(struct dph_conn)); struct pollfd pfd; if (!conn) return -1; DBL_INIT_LIST_HEAD(&conn->lnk); conn->sfd = sfd; conn->events = POLLIN | POLLOUT; conn->revents = 0; if (!(conn->co = co_create(func, NULL, stksize))) { free(conn); return -1; } DBL_LIST_ADDT(&conn->lnk, &chash[sfd % chash_size]); if (epoll_ctl(kdpfd, EP_CTL_ADD, sfd, POLLIN | POLLOUT) < 0) { DBL_LIST_DEL(&conn->lnk); co_delete(conn->co); free(conn); return -1; } co_call(conn->co, conn); return 0; } Oh ... I forgot the scheduler : static int dph_scheduler(int loop, unsigned int timeout) { int ii, nfds; struct dph_conn *conn; struct pollfd const *pfds; do { nfds = sys_epoll_wait(kdpfd, &pfds, timeout * 1000); for (ii = 0; ii < nfds; ii++, pfds++) { if ((conn = dph_find(pfds->fd))) { conn->revents = pfds->revents; if (conn->revents & conn->events) co_call(conn->co, conn); } } } while (loop); return 0; } And just to make it complete, those are read/write : int dph_read(struct dph_conn *conn, char *buf, int nbyte) { int n; while ((n = read(conn->sfd, buf, nbyte)) < 0) { if (errno == EINTR) continue; if (errno != EAGAIN && errno != EWOULDBLOCK) return -1; conn->events = POLLIN; co_resume(conn); } return n; } int dph_write(struct dph_conn *conn, char const *buf, int nbyte) { int n; while ((n = write(conn->sfd, buf, nbyte)) < 0) { if (errno == EINTR) continue; if (errno != EAGAIN && errno != EWOULDBLOCK) return -1; conn->events = POLLOUT; co_resume(conn); } return n; } The functions co_resume() and co_call() are the coroutine suspend and call. The one I'm using is this : http://www.goron.de/~froese/coro/ but coroutine implementation is trivial. You can change the same implementation to use an I/O driven state machine and the result does not change. > I am uncomfortable with the way the epoll code adds its own set of > notification hooks into the socket and pipe code. Much better would be > to extend the existing set of notification hooks, like the aio poll code > does. That would reduce the risk of kernel bugs where some subsystem > fails to deliver an event to one but not all types of poll notification > hooks and it would minimize the cost of the epoll patch when epoll is > not being used. Doh ! John, did you actually read the code ? Could you compare AIO level of intrusion inside the kernel code with the epoll one ? It fits _exactly_ the rt-signal hooks. One of the design goals for me was to add almost nothing on the main path. You can lookup here for a quick compare between aio poll and epoll for a test where events delivery efficency does matter ( pipetest ) : http://lse.sourceforge.net/epoll/index.html Now, I don't believe that a real world app will exchange 300000 tokens per second through a pipe, but this help you to understand the efficency of the epoll event notification subsystem. - Davide |
From: John G. M. <jg...@ne...> - 2002-10-31 02:08:11
|
Davide Libenzi wrote: >John, your first post about epoll was "the interface has a bug, please >do not merge it". > My first post about epoll pointed out how it was designed for single threaded callers and concluded: I certainly hope /dev/epoll itself doesn't get accepted into the kernel, the interface is error prone. Registering interest in a condition when the condition is already true should immediately generate an event, the epoll interface did not do that last time I saw it discussed. This deficiency in the interface requires callers to include more complex workaround code and is likely to result in subtle, hard to diagnose bugs. I did not say "the interface has a bug", I said that the interface is error prone. This is a deficiency that should be fixed before the interface is added to the kernel. >Sorry, what prevents you in coding that ? If you, instead of ranting >because epoll does not fit your personal idea of event notification, took >a look to the example http server used for the test ( coroutine based ) >you'll see that does exactly that. > You posted code which you claimed was "even more cleaner and simmetric" (sic) because it fell through to the do_use_fd() code instead of putting the do_use_fd() code in an else clause. A callback scheme is akin to the if/else structure. To adapt the first code to a callback scheme, the accept callback has to somehow arrange to call the do_use_fd() callback before returning to the event loop. This requirement is subtle and asymmetric. >I really don't believe this. Are you just trolling or what ? It is clear >that your acceptor routine has to do a little more work than that in a >real program. > Basically, you spawn off another coroutine. That complicates the "fall through to do_use_fd()" logic in the first code by requiring an external facility not required by the second code. The second code could simply have the accept code loop until EAGAIN. >Doh ! John, did you actually read the code ? > Yes, indeed. >Could you compare AIO level >of intrusion inside the kernel code with the epoll one ? > Aio poll extends the existing set of poll notification hooks with a callback mechanism. It then plugs into this callback mechanism in order to deliver events. The end result is that the same notification hooks are used for classic poll and aio poll. When aio poll is not being used, there is no additional performance penalty other than a slightly larger poll_table_entry and poll_table_page. Epoll creates a new callback mechanism and plugs into this new callback mechansim. It adds a new set of notification hooks which feed into this new callback mechansim. The end result is that there is one set of notification hooks for classic poll and another set for epoll. When epoll is not being used, the poll and socket code makes an additional set of checks to see that nobody has registered interest through the new callback mechanism. > It fits _exactly_ >the rt-signal hooks. One of the design goals for me was to add almost >nothing on the main path. You can lookup here for a quick compare between >aio poll and epoll for a test where events delivery efficency does matter >( pipetest ) : > This is a comparison of the cost of using epoll to the cost of using aio in one particular situation. It is irrelevant to the point I was making. >Now, I don't believe that a real world app will exchange 300000 tokens per >second through a pipe, but this help you to understand the efficency of >the epoll event notification subsystem. > > My understanding of the efficiency of the epoll event notification subsystem is: 1) Unlike the current aio poll, it amortizes the cost of interest registration/deregistration across multiple events for a given connection. 2) It declares multithreaded use out of scope, making optimizations that are only appropriate for use by single threaded callers. |
From: Davide L. <da...@xm...> - 2002-10-31 03:11:56
|
On Wed, 30 Oct 2002, John Gardiner Myers wrote: > You posted code which you claimed was "even more cleaner and simmetric" > (sic) because it fell through to the do_use_fd() code instead of putting > the do_use_fd() code in an else clause. A callback scheme is akin to > the if/else structure. To adapt the first code to a callback scheme, > the accept callback has to somehow arrange to call the do_use_fd() > callback before returning to the event loop. This requirement is subtle > and asymmetric. A callback scheme can be _trivially_ implemented use the current epoll. I'm sure you know exactly how to do it, so I'm not spending more time explaining it to you. > Basically, you spawn off another coroutine. That complicates the "fall > through to do_use_fd()" logic in the first code by requiring an external > facility not required by the second code. The second code could simply > have the accept code loop until EAGAIN. No it does not, you always fall through do_use_fd(). It's that simple. > Epoll creates a new callback mechanism and plugs into this new callback > mechansim. It adds a new set of notification hooks which feed into this > new callback mechansim. The end result is that there is one set of > notification hooks for classic poll and another set for epoll. When > epoll is not being used, the poll and socket code makes an additional > set of checks to see that nobody has registered interest through the new > callback mechanism. Where epoll hooks has nothing to do with ->f_po->poll() > > It fits _exactly_ > >the rt-signal hooks. One of the design goals for me was to add almost > >nothing on the main path. You can lookup here for a quick compare between > >aio poll and epoll for a test where events delivery efficency does matter > >( pipetest ) : > > > This is a comparison of the cost of using epoll to the cost of using aio > in one particular situation. It is irrelevant to the point I was making. See, I believe numbers talks. And it does make a pretty clear point indeed. > My understanding of the efficiency of the epoll event notification > subsystem is: > > 1) Unlike the current aio poll, it amortizes the cost of interest > registration/deregistration across multiple events for a given connection. Yep > 2) It declares multithreaded use out of scope, making optimizations that > are only appropriate for use by single threaded callers. It's not single threaded. It can be used in multithreaded environment if the one that code the app has a minimal idea of what he's doing. Like everything else. You cannot use a FILE* wildly sharing it randomly inside a multithreaded app, and expecting to receive coherent results. Like 95% of the APIs. Can those APIs be used in a multithreaded environment ? You bet, with care, like everything that uses freakin' threads. - Davide |
From: Suparna B. <su...@in...> - 2002-10-31 11:08:30
|
On Wed, Oct 30, 2002 at 07:21:24PM -0800, Davide Libenzi wrote: > On Wed, 30 Oct 2002, John Gardiner Myers wrote: > > > Epoll creates a new callback mechanism and plugs into this new callback > > mechansim. It adds a new set of notification hooks which feed into this > > new callback mechansim. The end result is that there is one set of > > notification hooks for classic poll and another set for epoll. When > > epoll is not being used, the poll and socket code makes an additional > > set of checks to see that nobody has registered interest through the new > > callback mechanism. > > Where epoll hooks has nothing to do with ->f_po->poll() > I think what John means, and what Jamie has also brought up in a separate note is that now when an event happens on an fd, in some cases there are tests for 3 kinds of callbacks that get triggered -- the wait queue for poll type registrations, the fasync list for sigio, and the new epoll file send notify type callbacks. There is a little overhead (not sure if significant) for each kind of test ... > > > > > It fits _exactly_ > > >the rt-signal hooks. One of the design goals for me was to add almost > > >nothing on the main path. You can lookup here for a quick compare between > > >aio poll and epoll for a test where events delivery efficency does matter > > >( pipetest ) : > > > > > This is a comparison of the cost of using epoll to the cost of using aio > > in one particular situation. It is irrelevant to the point I was making. > > See, I believe numbers talks. And it does make a pretty clear point > indeed. > > > > > My understanding of the efficiency of the epoll event notification > > subsystem is: > > > > 1) Unlike the current aio poll, it amortizes the cost of interest > > registration/deregistration across multiple events for a given connection. > > Yep > Adding persistent iocb support to aio doesn't appear too hard, and to be fair to aio, it does seem to help it come much closer to epoll, in fact very much closer at least for pipetest with a quickly hacked version that I tried. There still appears to be a gap remaining to be covered i.e epoll continuing to lead :) albeit by a smaller margin. A little more magic is going on than just interest registration amortization (and I suspect its not just the threading argument), worth analysing if not for any other reason but to gain a better understanding of these 2 event delivery mechanisms the core for both of which are now in the mainline kernel. Regards Suparna -- Suparna Bhattacharya (su...@in...) Linux Technology Center IBM Software Labs, India |
From: Davide L. <da...@xm...> - 2002-10-31 18:32:55
|
On Thu, 31 Oct 2002, Suparna Bhattacharya wrote: > I think what John means, and what Jamie has also brought up in a > separate note is that now when an event happens on an fd, in some cases > there are tests for 3 kinds of callbacks that get triggered -- the wait > queue for poll type registrations, the fasync list for sigio, and the > new epoll file send notify type callbacks. There is a little overhead > (not sure if significant) for each kind of test ... The poll hooks is not where an edge triggered event notification API wants to hook. For the way notification are sent and for the registration method, that is not the most efficent thing. Hooking inside the fasync list is worth to be investigated and I'll look into it as soon as I finished the patch for 2.5.45 for Linus. It does have certain limits IMHO, like the single lock protection. I'll look into it, even if the famous cost for the extra callback check cannot even be measured IMHO. - Davide |
From: Jamie L. <lk...@ta...> - 2002-10-30 23:02:09
|
John Gardiner Myers wrote: > I am uncomfortable with the way the epoll code adds its own set of > notification hooks into the socket and pipe code. Much better would be > to extend the existing set of notification hooks, like the aio poll code > does. Fwiw, I agree with the above (I'm having a think about it). I also agree with criticisms that epoll should test and send an event on registration, but only _if_ the test is cheap. Nothing to do with correctness (I like the edge semantics as they are), but because delivering one event is so infinitesimally low impact with epoll that it's preferable to doing a single speculative read/write/whatever. Regarding the effectiveness of the optimisation, I'd guess that quite a lot of incoming connections do not come with initial data in the short scheduling time after a SYN (unless it's on a LAN). I don't know this for sure though. -- Jamie |
From: Davide L. <da...@xm...> - 2002-10-30 23:44:14
|
On Wed, 30 Oct 2002, Jamie Lokier wrote: > John Gardiner Myers wrote: > > I am uncomfortable with the way the epoll code adds its own set of > > notification hooks into the socket and pipe code. Much better would be > > to extend the existing set of notification hooks, like the aio poll code > > does. > > Fwiw, I agree with the above (I'm having a think about it). > > I also agree with criticisms that epoll should test and send an event > on registration, but only _if_ the test is cheap. Nothing to do with > correctness (I like the edge semantics as they are), but because > delivering one event is so infinitesimally low impact with epoll that > it's preferable to doing a single speculative read/write/whatever. > > Regarding the effectiveness of the optimisation, I'd guess that quite > a lot of incoming connections do not come with initial data in the > short scheduling time after a SYN (unless it's on a LAN). I don't > know this for sure though. Ok Jamie, try to explain me which kind of improvement this first drop will bring. And also, how such first drop would not bring a "confusion" for the user, letting him think that he can go sleeping event w/out having first received EAGAIN. Isn't it better to say "you wait for events after EAGAIN", instead of "you wait for events after EAGAIN but after accept/connect". The cost of the test will be basically the cost of a ->poll(), that is exactly the same cost of the very first read()/write() that you would do by following the current API rule. - Davide |
From: Jamie L. <lk...@ta...> - 2002-10-31 00:53:07
|
Davide Libenzi wrote: > The cost of the test will be basically the cost of a ->poll(), that is > exactly the same cost of the very first read()/write() that you would do > by following the current API rule. No, the cost of ->poll() is somewhat less than read()/write(), because the latter requires a system call and the former does not. System calls are still nowhere near as cheap as function calls. > > I also agree with criticisms that epoll should test and send an event > > on registration, but only _if_ the test is cheap. Nothing to do with > > correctness (I like the edge semantics as they are), but because > > delivering one event is so infinitesimally low impact with epoll that > > it's preferable to doing a single speculative read/write/whatever. > > > > Regarding the effectiveness of the optimisation, I'd guess that quite > > a lot of incoming connections do not come with initial data in the > > short scheduling time after a SYN (unless it's on a LAN). I don't > > know this for sure though. > > Ok Jamie, try to explain me which kind of improvement this first drop will > bring. I have thought about an optimal server state machine. (I presume from your carefully thought out implementation that you have too). In a state machine, each fd has some user-space state. I've already hinted at how this is used to prevent starvation/livelock on a busy server, and make service fairer. I would take that further and _defer_ the epoll_ctl() to register an fd until the first time I have seen EAGAIN from that fd. This is because in some cases, epoll_ctl() would not be needed at all - so we can remove that overhead, and the system call overhead. Now you would force me to call read() or write() after the epoll_ctl(), even though I _know_ the result is always going to be EAGAIN. You're forcing me to make an always redundant system call. But I can't omit it, because that's a race condition. So, I've thought about the _optimal_ state machine and it's clear that epoll should test the condition on fd registration - for best performance. (Nothing to do with scalability, just raw performance). > And also, how such first drop would not bring a "confusion" for the > user, letting him think that he can go sleeping event w/out having first > received EAGAIN. Isn't it better to say "you wait for events after EAGAIN", > instead of "you wait for events after EAGAIN but after accept/connect". Be careful with your rules. epoll should work with blocking fds too, if you understand the rules well enough, and fd registration doesn't have to be done at the same time as accept/connect/pipe. Your current rule in practice is: an event is generated on every "would-block" -> "ready" transition. after fd registration, you must treat the fd as "ready". The proposed rule is this: an event is generated on every "would-block" -> "ready" transition. after fd registration, you may treat the fd as in any state you like. The proposed rule is better because it permits better optimisations in user space, as explained earlier. (If you _really_ want to avoid the call to ->poll() when user space doesn't care, make that a flag argument to epoll_ctl()). enjoy :) -- Jamie |
From: Davide L. <da...@xm...> - 2002-10-31 04:05:40
|
On Thu, 31 Oct 2002, Jamie Lokier wrote: > No, the cost of ->poll() is somewhat less than read()/write(), because > the latter requires a system call and the former does not. System > calls are still nowhere near as cheap as function calls. Jamie, it's not for the cost, it's that IMHO is useless. And might generate confusion on the API usage. > I have thought about an optimal server state machine. (I presume from > your carefully thought out implementation that you have too). > > In a state machine, each fd has some user-space state. I've already > hinted at how this is used to prevent starvation/livelock on a busy > server, and make service fairer. > > I would take that further and _defer_ the epoll_ctl() to register an > fd until the first time I have seen EAGAIN from that fd. This is > because in some cases, epoll_ctl() would not be needed at all - so we > can remove that overhead, and the system call overhead. > > Now you would force me to call read() or write() after the > epoll_ctl(), even though I _know_ the result is always going to be > EAGAIN. You're forcing me to make an always redundant system call. > But I can't omit it, because that's a race condition. > > So, I've thought about the _optimal_ state machine and it's clear that > epoll should test the condition on fd registration - for best > performance. (Nothing to do with scalability, just raw performance). Jamie I don't force you to call read/write soon. Your state machine will have a state 0, from where everything starts. Let's say that this server is an SMTP server and that supports PIPELINING. When a client connect ( you accept ) you will basically have your acceptor routine that puts the fd for the new connection inside your list of ready-fds. Such list will contain connection status, state machine state and a callback at the bare bone. The whenever you feel it appropriate you pop the fd from the ready list and you call the associated callback. That for state 0 will have encoded "send SMTP welcome message" to the client. The socket write buffer will be empty and you write() will return != EAGAIN. So you keep your fd inside your ready list. Having a ready list enables you to handle priorities, fairness, etc... Having successfully sent the welcome string will move you to the next state, state 1. Whenever you'll find it appropriate, you'll call again the callback associated with the file descriptor, that for state 1 will have encoded "read SMTP command". Now suppose that the SMTP client is lazy and you have nothing in the input buffer ( or you partially read the SMTP command ). The read() will return EAGAIN, you remain in state 1 and you remove the fd from your ready list. This guy is _ready_ to generate an event. One of thenext time you'll call epoll_wait(2) you'll find our famous fd ready to be used. You push it in the ready list, and it's up to you, based on your fairness policies, to use it soon or not. <b> The important thing is that you keep it in your ready list and you do not go wait for it </b> Now the PIPELINING stuff makes it worth to have your ready-fds list to apply fairness rules among your clients. The above pattern repeats by moving your state machine among your states, until finally, you reach the final state where you drop the connection. Now, this one, that is a typical state machine implementation can be _trivially_ implemented with epoll, and I don't see how adding an initial event might help in this design. The other even more trivial implementation using coroutines shows its semplicitly in a pretty clear way. > Be careful with your rules. epoll should work with blocking fds too, > if you understand the rules well enough, and fd registration doesn't > have to be done at the same time as accept/connect/pipe. Obviously you can register the fd whenever you want. I would take _a_lot_ of care using it with blocking files. Not because it will crash or something like this but because you might stall you app on a reat/write operation. Suppose you received your event, and you have 2000 bytes in your input buffer for example. You start reading the data with a blocking file and when the data is over you'll be waiting on tha system call, that is definitely what you want to do in a 1:N ( one task, N files ) application architecture. You don't really want to use blocking files with an edge triggered event API. > Your current rule in practice is: > > an event is generated on every "would-block" -> "ready" transition. > after fd registration, you must treat the fd as "ready". > > The proposed rule is this: > > an event is generated on every "would-block" -> "ready" transition. > after fd registration, you may treat the fd as in any state you like. > > The proposed rule is better because it permits better optimisations in > user space, as explained earlier. (If you _really_ want to avoid the > call to ->poll() when user space doesn't care, make that a flag > argument to epoll_ctl()). I still prefer 1) Jamie, besides the system call cost ( that is not always a cost, see soon ready ops ), there's the fact of making the user to follow a behavior pattern. That point 2) leaves uncertain. Now, I guess that we will spend a lot of time arguing and talking about nothing. Let's go to the code. Show me with real code ( possibly not 25000 lines :) ) where you get stuck w/out having the initial event and if it makes sense and there's no clean way to solve it in user space, I'll seriously consider your ( and John ) proposal. - Davide |
From: Jamie L. <lk...@ta...> - 2002-10-31 15:07:41
|
Davide Libenzi wrote: > [long description of ready lists] Davide, you have exactly explained ready lists, which I assume we'd already agreed on (cf. beer), and completely missed the point about deferring the call to epoll_ctl(). You haven't mentioned that at all your description. Consider cacheing http proxy server. Let's say 25% of the requests to a proxy server are _not_ using pipelining. Then you can save 25% of the calls to epoll_ctl() if network conditions are favourable. > Jamie I don't force you to call read/write soon. You do if I try to optimise by deferring the call to epoll_ctl(). Let's see how my user space optimisation is affected in your description. > Your state machine will > have a state 0, from where everything starts. Let's say that this server > is an SMTP server and that supports PIPELINING. When a client connect ( > you accept ) you will basically have your acceptor routine that puts the > fd for the new connection inside your list of ready-fds. Such list will > contain connection status, state machine state and a callback at the bare > bone. The whenever you feel it appropriate you pop the fd from the ready > list and you call the associated callback. That for state 0 will have > encoded "send SMTP welcome message" to the client. The socket write buffer > will be empty and you write() will return != EAGAIN. So you keep your fd > inside your ready list. Having a ready list enables you to handle > priorities, fairness, etc... Having successfully sent the welcome string > will move you to the next state, state 1. Whenever you'll find it > appropriate, you'll call again the callback associated with the file > descriptor, that for state 1 will have encoded "read SMTP command". Now > suppose that the SMTP client is lazy and you have nothing in the input > buffer ( or you partially read the SMTP command ). The read() will return > EAGAIN, you remain in state 1 and you remove the fd from your ready list. At this point, I would call epoll_ctl(). Note, I do _not_ call epoll_ctl() after accept(), because that is a waste of time. It is better to defer it because sometimes it is not needed at all. > This guy is _ready_ to generate an event. One of thenext time you'll call > epoll_wait(2) you'll find our famous fd ready to be used. Most of the time this works, but there's a race condition: after I saw EAGAIN and before I called epoll_ctl(), the state might have changed. So I must call read() after epoll_ctl(), even though it is 99.99% likely to return EAGAIN. Here is the system call sequence that I end up executing: - read() = nbytes - read() = EGAIN - epoll_ctl() // Called first time we see EAGAIN, // if we will want to read more. - read() = EGAIN The final read is 99.99% likely to return EGAIN, and could be eliminated if epoll_ctl() had an option to test the condition. > Now, this one, that is a typical state machine implementation > can be _trivially_ implemented with epoll, and I don't see how adding an > initial event might help in this design. The other even more trivial > implementation using coroutines shows its semplicitly in a pretty clear > way. That's because it doesn't help in that design. It helps in a different (faster in some scenarios ;) design. > You don't really want to use blocking files with > an edge triggered event API. Agreed, it is rarely useful, but I felt your description ("event after AGAIN") was technically incorrect. Of course it would be ok to _specify_ the API as only giving defined behaviour for non-blocking I/O. > > > > an event is generated on every "would-block" -> "ready" transition. > > after fd registration, you must treat the fd as "ready". > > > > The proposed rule is this: > > > > an event is generated on every "would-block" -> "ready" transition. > > after fd registration, you may treat the fd as in any state you like. > > > > The proposed rule is better because it permits better optimisations in > > user space, as explained earlier. (If you _really_ want to avoid the > > call to ->poll() when user space doesn't care, make that a flag > > argument to epoll_ctl()). > > I still prefer 1) Jamie, besides the system call cost ( that is not always > a cost, see soon ready ops ), Do you avoid the cost of epoll_ctl() per new fd? > there's the fact of making the user to follow a behavior > pattern. That point 2) leaves uncertain. That's the point :-) Flexibility is _good_. It means that somebody can implement a technique that you haven't thought of. With 2) the _programmer_ (let's assume some level of understanding) can use the exact application code that you offered. It will work fine. However if they're feeling clever, like me, they can optimise further. I would suggest, though, to simply provide both options: EP_CTL_ADD and EP_CTL_ADD_AND_TEST. That's so explicit that nobody can be confused! > Show me with real code ( possibly not 25000 lines :) ) > where you get stuck w/out having the initial event and if it makes sense > and there's no clean way to solve it in user space, I'll seriously > consider your ( and John ) proposal. Davide, I don't write buggy code deliberately. My code would not get stuck using present epoll. Neither APIs 1) nor 2) have a bug, but version 1) is slower with some kinds of state machine. (Don't confuse me with the person who said your API is buggy -- it is _not_ buggy, it's just not as flexible as it should be). I can write code that shows the optimisation if that would make it clearer. -- Jamie |
From: Davide L. <da...@xm...> - 2002-10-31 19:00:35
|
On Thu, 31 Oct 2002, Jamie Lokier wrote: > Davide, you have exactly explained ready lists, which I assume we'd > already agreed on (cf. beer), and completely missed the point about > deferring the call to epoll_ctl(). You haven't mentioned that at all > your description. > > Consider cacheing http proxy server. Let's say 25% of the requests to > a proxy server are _not_ using pipelining. Then you can save 25% of > the calls to epoll_ctl() if network conditions are favourable. > > You do if I try to optimise by deferring the call to epoll_ctl(). > Let's see how my user space optimisation is affected in your description. > > At this point, I would call epoll_ctl(). Note, I do _not_ call > epoll_ctl() after accept(), because that is a waste of time. It is > better to defer it because sometimes it is not needed at all. > > > This guy is _ready_ to generate an event. One of thenext time you'll call > > epoll_wait(2) you'll find our famous fd ready to be used. > > Most of the time this works, but there's a race condition: after I saw > EAGAIN and before I called epoll_ctl(), the state might have changed. > So I must call read() after epoll_ctl(), even though it is 99.99% > likely to return EAGAIN. > > Here is the system call sequence that I end up executing: > > - read() = nbytes > - read() = EGAIN > - epoll_ctl() // Called first time we see EAGAIN, > // if we will want to read more. > - read() = EGAIN > > The final read is 99.99% likely to return EGAIN, and could be > eliminated if epoll_ctl() had an option to test the condition. > > Do you avoid the cost of epoll_ctl() per new fd? Jamie, the cost of epoll_ctl(2) is minimal/zero compared with the average life of a connection. Also it might be done once at fd "creation" and one at fd "removal". It's not inside an high frequency loop like epoll_wait(2). Believe me, or ... do not believe me and show me a little performance data that shows up this performance degradation due a soonish epoll_ctl(2). And Jamie, if you really really want to use such pattern ( delaying the fd registration, that IMHO does not help you in getting any performance boost ) you can still do it in user space ( poll() timeout 0 after epoll_ctl(2) ). Jamie, I'm _really_ willing to be contradicted with performance data here. > I would suggest, though, to simply provide both options: EP_CTL_ADD > and EP_CTL_ADD_AND_TEST. That's so explicit that nobody can be > confused! The EP_CTL_ADD_AND_TEST would do the poll() timeout 0 trick in kernel space. Is it fast done in kernel ? Sure, you can measure something at rates of 500000 of registration per second. Even here Jamie, I'm willing to be contradicted by performance data. You're trying to optimize something that is not inside an high frequency loop, and it's going to give you any measurable improvement IMHO. - Davide |
From: Dan K. <da...@ke...> - 2002-11-01 17:26:18
|
Davide Libenzi wrote: >>Do you avoid the cost of epoll_ctl() per new fd? > > Jamie, the cost of epoll_ctl(2) is minimal/zero compared with the average > life of a connection. Depends on the workload. Where I work, the http client I'm writing has to perform extremely well even on 1 byte files with HTTP 1.0. Minimizing system calls is suprisingly important - even a gettimeofday hurts. - Dan |
From: Davide L. <da...@xm...> - 2002-11-01 17:36:03
|
On Fri, 1 Nov 2002, Dan Kegel wrote: > Davide Libenzi wrote: > >>Do you avoid the cost of epoll_ctl() per new fd? > > > > Jamie, the cost of epoll_ctl(2) is minimal/zero compared with the average > > life of a connection. > > Depends on the workload. Where I work, the http client I'm writing > has to perform extremely well even on 1 byte files with HTTP 1.0. > Minimizing system calls is suprisingly important - even > a gettimeofday hurts. Dan, is it _one_ gettimeofday() or a gettimeofday() inside a loop ? gettimeofday() is of the order of few microseconds ... and If your clients works with anything alse than a loopback, few microseconds shouldn't weigh in much compared to connect/send/recv/close on a network connection. It is not much the fact that you transfer one byte, it's the whole TCP handshake cost that weighs in. - Davide |
From: Dan K. <da...@ke...> - 2002-11-01 18:25:32
|
Davide Libenzi wrote: > On Fri, 1 Nov 2002, Dan Kegel wrote: > >>Davide Libenzi wrote: >> >>>>Do you avoid the cost of epoll_ctl() per new fd? >>> >>>Jamie, the cost of epoll_ctl(2) is minimal/zero compared with the average >>>life of a connection. >> >>Depends on the workload. Where I work, the http client I'm writing >>has to perform extremely well even on 1 byte files with HTTP 1.0. >>Minimizing system calls is suprisingly important - even >>a gettimeofday hurts. > > Dan, is it _one_ gettimeofday() or a gettimeofday() inside a loop ? > gettimeofday() is of the order of few microseconds ... and If your clients > works with anything alse than a loopback, few microseconds shouldn't weigh > in much compared to connect/send/recv/close on a network connection. It is > not much the fact that you transfer one byte, it's the whole TCP handshake > cost that weighs in. The scenario is: we're doing load testing of http products, and for various reasons, we want line-rate traffic with the smallest possible message size. i.e. we want the maximum number of HTTP requests/responses per second. Hence the 1 byte payloads. A single system call on the slowish embedded processor I'm using has a suprisingly large impact on the number of http gets per second I can do. A 1% increase in speed is worth it for me! So please do try to reduce the number of syscalls needed to handle very short TCP sessions, if possible. - Dan |
From: Jamie L. <lk...@ta...> - 2002-11-01 19:17:04
|
Dan Kegel wrote: > Davide Libenzi wrote: > >>Do you avoid the cost of epoll_ctl() per new fd? > > > >Jamie, the cost of epoll_ctl(2) is minimal/zero compared with the average > >life of a connection. > > Depends on the workload. Where I work, the http client I'm writing > has to perform extremely well even on 1 byte files with HTTP 1.0. > Minimizing system calls is suprisingly important - even > a gettimeofday hurts. For this sort of thing, I would like to see an option to automatically set the non-blocking flag on accept(). To really squeeze the system calls, you could also automatically epoll-register on accept(), and for super bonus automatically do the accept() at event delivery time. But it's getting very silly at that point. -- Jamie |
From: Charlie K. <kr...@ac...> - 2002-11-01 20:05:11
|
Jamie Lokier <lk...@ta...> writes: > For this sort of thing, I would like to see an option to automatically > set the non-blocking flag on accept(). To really squeeze the system > calls, you could also automatically epoll-register on accept(), and > for super bonus automatically do the accept() at event delivery time. > But it's getting very silly at that point. > -- Jamie I would like to see a new kind of nonblocking flag that implies the use of epoll. So instead of giving O_NONBLOCK to fctnl(F_SETFL), you give O_NONBLOCK_EPOLL. In addition to becoming non-blocking, the socket is added to epoll interest set. Furthermore, if the socket is a "listener" socket, all connections accepted on the socket inherit the non-blocking status and are added automatically to the same epoll interest set. It's true that this can get silly though. I'd like to do the same with other flags, like TCP_CORK. -- Buck > -- > To unsubscribe, send a message with 'unsubscribe linux-aio' in > the body to maj...@kv.... For more info on Linux AIO, > see: http://www.kvack.org/aio/ |
From: Jamie L. <lk...@ta...> - 2002-11-01 20:14:28
|
Charlie Krasic wrote: > I would like to see a new kind of nonblocking flag that implies the > use of epoll. So instead of giving O_NONBLOCK to fctnl(F_SETFL), you > give O_NONBLOCK_EPOLL. In addition to becoming non-blocking, the > socket is added to epoll interest set. Furthermore, if the socket is > a "listener" socket, all connections accepted on the socket inherit > the non-blocking status and are added automatically to the same epoll > interest set. It's true that this can get silly though. I'd like to > do the same with other flags, like TCP_CORK. ... and close-on-exec. -- Jamie |
From: Mark M. <ma...@ma...> - 2002-11-01 20:20:35
|
On Fri, Nov 01, 2002 at 07:16:43PM +0000, Jamie Lokier wrote: > > Depends on the workload. Where I work, the http client I'm writing > > has to perform extremely well even on 1 byte files with HTTP 1.0. > > Minimizing system calls is suprisingly important - even > > a gettimeofday hurts. > For this sort of thing, I would like to see an option to automatically > set the non-blocking flag on accept(). To really squeeze the system > calls, you could also automatically epoll-register on accept(), and > for super bonus automatically do the accept() at event delivery time. > But it's getting very silly at that point. Not really... isn't accept() automatically performed ahead of time anyways, as long as the listen queue isn't full? Another issue for the 'unified event notification model': How does epoll interact with signals, specifically the race condition between determining the timeout that should be passed to epoll_wait(), and epoll_wait() itself? (see pselect() for info) For example: it is very regular for priority to be given to a fd callback before a signal callback, meaning that epoll_wait() would be called with timeout=0 if a received signal did not have its callback executed yet, or something greater, otherwise. I would like to see at least of the following (suggestions made by other people) in the final version: 1) Userspace data pointer to allow more efficient userspace dispatching when epoll_wait() returns. (Something about scanning array structures for matching fd arguments rubs me the wrong way -- it shouldn't be necessary) 2) Reduced requirements to issue system calls such as read() when EAGAIN is the expected return value. The whole 'do a quick poll() or similar at registration time upon request' issue - for obscure cases that would require complex code, or code that cannot yet be agreed upon, this could temporarily mark events ready at registration without checking, with a goal of eliminating this behaviour one type of file at a time. Although the ability to wait on futex or timeout objects seems clever, I'm not sure that we are at a point that we know how they would be commonly used yet. Right now people need a poll() replacement for file descriptors. Timeouts can be handled by manipulating the argument to epoll_wait() and performing userspace analysis (same as poll()). Futex objects have not (to my knowledge) yet been used in great numbers at the same time (i.e. wait for 100 futexes to be obtained) probably because the routines necessary to perform this operation do not yet exist. It might be nice to fit this into epoll later, but it doesn't need to yet. mark -- ma...@mi.../ma...@nc.../ma...@no... __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them... http://mark.mielke.cc/ |
From: Jamie L. <lk...@ta...> - 2002-10-31 15:43:30
|
ps. I thought I should explain what bothers me most about epoll at the moment. It's good at what it does, but it's so very limited in what it supports. I have a high performance server application in mind, that epoll is _almost_ perfect for but not quite. Davide, you like coroutines, so perhaps you will appreciate a web server that serves a mixture of dynamic and static content, using coroutines and user+kernel threading in a carefully balanced way. Dynamic content is cached, accurately (taking advantage of nanosecond mtimes if possible), yet served as fast as static pages (using a clever cache validation method), and is built from files (read using aio to improve throughput) and subrequests to other servers just like a proxy. Data is served zero-copy using sendfile and /dev/shm. A top quality server like that, optimised for performance, has to respond to these events: - network accept() - read/write/exception on sockets and pipes - timers - aio - futexes - dnotify events See how epoll only helps with the first two? And this is the very application space that epoll could _almost_ be perfect for. Btw, it doesn't _have_ to be a web server. Enterprise scale Java runtimes, database servers, spider clients, network load generators, proxies, even humble X servers - also have very similar requirements. There are several scalable and fast event queuing mechanisms in the kernel now: rt-signals, aio and epoll, yet each of them is limited by only keeping track of a few kinds of possible event. Technically, it's possible to use them all together. If you want to react to all the kinds of events I listed above, you have to. But it's mighty ugly code to use them all at once, and it's certainly not the "lean and mean" event loop that everyone aspires to. By adding yet another mechanism without solving the general problem, epoll just makes the mighty ugly userspace more ugly. (But it's probably worth using - socket notifcation through rt-signals has its own problems). I would very much like to see a general solution to the problem of all different kinds of events being queued to userspace efficiently, through one mechanism ("to bind them all..."). Every piece of this puzzle has been written already, they're just not joined up very well. I'm giving this serious thought now, if anyone wants to offer input. -- Jamie ps. Alan, you mentioned something about futexes being suitable. Was that a vague notion, or do you have a clear idea in mind? (A nice way to collect events from a _set_ of futexes might be just the thing.) |
From: Alan C. <al...@lx...> - 2002-10-31 16:26:15
|
On Thu, 2002-10-31 at 15:41, Jamie Lokier wrote: > - network accept() > - read/write/exception on sockets and pipes > - timers > - aio > - futexes > - dnotify events > > See how epoll only helps with the first two? And this is the very > application space that epoll could _almost_ be perfect for. > > ps. Alan, you mentioned something about futexes being suitable. > Was that a vague notion, or do you have a clear idea in mind? > > (A nice way to collect events from a _set_ of futexes might be just the thing.) The futexes do all the high performance stuff you actually need. One way to do it is to do user space signal delivery setting futexes off but that means user space switches and is just wrong. Setting a list of futexes instead of signal delivery in kernel space is fast. Letting the user pick what futexes get set allows you to do neat stuff like trees of wakeup without having to handle t kernel side. What is hard is multiple futex waits and livelock for that. I think it can be done properly but I've not sat down and designed it all out - I wonder what Rusty thinks. Alan |
From: Davide L. <da...@xm...> - 2002-10-31 20:18:43
|
On Thu, 31 Oct 2002, Jamie Lokier wrote: > ps. I thought I should explain what bothers me most about epoll at the > moment. It's good at what it does, but it's so very limited in what > it supports. > > I have a high performance server application in mind, that epoll is > _almost_ perfect for but not quite. > > Davide, you like coroutines, so perhaps you will appreciate a web > server that serves a mixture of dynamic and static content, using > coroutines and user+kernel threading in a carefully balanced way. > Dynamic content is cached, accurately (taking advantage of nanosecond > mtimes if possible), yet served as fast as static pages (using a > clever cache validation method), and is built from files (read using > aio to improve throughput) and subrequests to other servers just like > a proxy. Data is served zero-copy using sendfile and /dev/shm. > > A top quality server like that, optimised for performance, has to > respond to these events: > > - network accept() > - read/write/exception on sockets and pipes > - timers > - aio > - futexes > - dnotify events > > See how epoll only helps with the first two? And this is the very > application space that epoll could _almost_ be perfect for. > > Btw, it doesn't _have_ to be a web server. Enterprise scale Java > runtimes, database servers, spider clients, network load generators, > proxies, even humble X servers - also have very similar requirements. > > There are several scalable and fast event queuing mechanisms in the > kernel now: rt-signals, aio and epoll, yet each of them is limited by > only keeping track of a few kinds of possible event. > > Technically, it's possible to use them all together. If you want to > react to all the kinds of events I listed above, you have to. But > it's mighty ugly code to use them all at once, and it's certainly not > the "lean and mean" event loop that everyone aspires to. > > By adding yet another mechanism without solving the general problem, > epoll just makes the mighty ugly userspace more ugly. (But it's > probably worth using - socket notifcation through rt-signals has its > own problems). > > I would very much like to see a general solution to the problem of all > different kinds of events being queued to userspace efficiently, > through one mechanism ("to bind them all..."). Every piece of this puzzle > has been written already, they're just not joined up very well. > > I'm giving this serious thought now, if anyone wants to offer input. Jamie, the fact that epoll supports a limited number of "objects" was an as-designed at that time. I see it quite easy to extend it to support other objects. Futexes are a matter of one line of code int : /* Waiter either waiting in FUTEX_WAIT or poll(), or expecting signal */ static inline void tell_waiter(struct futex_q *q) { wake_up_all(&q->waiters); if (q->filp) { send_sigio(&q->filp->f_owner, q->fd, POLL_IN); + file_notify_send(q->filp, ION_IN, POLLIN | POLLRDNORM); } } Timer, as long as you access them through a file* interface ( like futexes ) will become trivial too. Another line should be sufficent for dnotify : void __inode_dir_notify(struct inode *inode, unsigned long event) { struct dnotify_struct * dn; struct dnotify_struct **prev; struct fown_struct * fown; int changed = 0; write_lock(&dn_lock); prev = &inode->i_dnotify; while ((dn = *prev) != NULL) { if ((dn->dn_mask & event) == 0) { prev = &dn->dn_next; continue; } fown = &dn->dn_filp->f_owner; send_sigio(fown, dn->dn_fd, POLL_MSG); + file_notify_send(dn->dn_filp, ION_IN, POLLIN | POLLRDNORM | POLLMSG); if (dn->dn_mask & DN_MULTISHOT) prev = &dn->dn_next; else { *prev = dn->dn_next; changed = 1; kmem_cache_free(dn_cache, dn); } } if (changed) redo_inode_mask(inode); write_unlock(&dn_lock); } This is the result of a quite quick analysis, but I do not expect it to be much more difficult than that. - Davide |