From: Jamie L. <lk...@ta...> - 2002-10-31 23:02:32
|
Davide, I think you are right. That's why I said epoll was _nearly_ perfect :) Davide Libenzi wrote: > Jamie, the fact that epoll supports a limited number of "objects" was an > as-designed at that time. I see it quite easy to extend it to support > other objects. Futexes are a matter of one line of code int : Agreed - though I'd prefer if the overhead of creating a temporary fd for each futex were eliminated, as well as the potentially large fd table. (In a threaded app, it's reasonable to have many more futexes than sockets, and they are created and destroyed rapidly too. No data on how many of those futexes need to be registered, though). In other words, add another op to sys_futex() called FUTEX_EPOLL which directly registers the futex on an epoll interest list, and let epoll report those events as futex events. (I suspect that is quite easy too). > Timer, as long as you access them through a file* interface ( like futexes ) > will become trivial too. Another line should be sufficent for dnotify : Sorry (<humble/>), ignore timers. Somehow I picked up the idea that epoll_wait() didn't have a timeout from some example or other, which was very silly of me. I've read the patch properly now! Of course epoll supports timers - a timeout is quite enough for user space. > void __inode_dir_notify(struct inode *inode, unsigned long event) Agreed. This is looking good :) It's lucky that polling for readability on a directory is not useful in any other way, though :) The semantics for this are a bit confusing and inconsistent with poll(). User gets POLL_RDNORM event which means something in the directory has changed, not that the directory is now readable or that poll() would return POLL_RDNORM. It really should be a different flag, made for the purpose. > This is the result of a quite quick analysis, but I do not expect it to be > much more difficult than that. Someone suggested hooking into ->poll() as a less obtrusive way to implement epoll. You're probably right that it's quicker to hook directly as you have done, although it would be great if epoll could somehow "fall back" to using ->poll() in the cases where epoll isn't directly supported by a file object. I wrote quite a lot about futexes above. That's because good futex support, and fallback to ->poll() would pretty much make epoll universal. What do you think of these ideas?: 1. Add FUTEX_EPOLL operation to futex.c, which registers a futex with an epoll interest list. This would cause FUTEX_WAKE calls on that address to generate epoll events. Some care is needed here to keep track of the exact number of events generated, because some rather subtle usages of futex depend on the return value from futex_wake being the _exact_ number of waiters that are woken. It would have to correspond to the exact number of events counted by userspace. 2. Add a check to EP_CTL_ADD which checks whether a file supports epoll notifications natively. Perhaps a file_operations hook is in order here. If it does, great. If not, fall back to a generic mechanism that uses the file's ->poll() method. (I haven't thought through for sure how plausible this is). Magically, every kind of fd works, including special devices, and the things that are most performance critical (sockets, pipes, futexes) are tuned. Yum! 3. Eliminate send_sigio() calls - pass all events to epoll, and let epoll dispatch signals where they have been registered. In combination with (2), this magically provides SIGIO support for all fd types as well. 4. Merge the aio and epoll event reporting functions: io_getevents and epoll_wait are remarkably similar, and should really be one function. It would introduce binary incompatibility somewhere though. There are a few cherries to go on top but I don't want to make this email any longer. Those are the essentials :) -- Jamie ps. Falling back to ->poll also means you can make trees of notifications, like Alan suggested :) pps. I much prefer epoll's use of an fd for the interest list than aio's aio_context_t. |
From: Davide L. <da...@xm...> - 2002-11-01 00:51:51
|
On Thu, 31 Oct 2002, Jamie Lokier wrote: > Davide, I think you are right. That's why I said epoll was _nearly_ perfect :) > > Davide Libenzi wrote: > > Jamie, the fact that epoll supports a limited number of "objects" was an > > as-designed at that time. I see it quite easy to extend it to support > > other objects. Futexes are a matter of one line of code int : > > Agreed - though I'd prefer if the overhead of creating a temporary fd > for each futex were eliminated, as well as the potentially large fd > table. (In a threaded app, it's reasonable to have many more futexes > than sockets, and they are created and destroyed rapidly too. No data > on how many of those futexes need to be registered, though). > > In other words, add another op to sys_futex() called FUTEX_EPOLL which > directly registers the futex on an epoll interest list, and let epoll > report those events as futex events. Jamie, the futex support can be easily done with one line of code patch. I still prefer the one-to-one mapping between futexes and files. It makes everything easier. I don't really see futex creation/destroy as an high frequency event that might be suitable for optimization. Usually you have your own set of resources to be "protected" and in 95% of cases you know those resources from the beginning. > > Timer, as long as you access them through a file* interface ( like futexes ) > > will become trivial too. Another line should be sufficent for dnotify : > > Sorry (<humble/>), ignore timers. Somehow I picked up the idea that > epoll_wait() didn't have a timeout from some example or other, which > was very silly of me. I've read the patch properly now! Of course > epoll supports timers - a timeout is quite enough for user space. If you want to timeout I/O operations you can easily put a timer routine in your main event scheduler loop. But I still like the idea of timers easily accessible through a file* interface. > > void __inode_dir_notify(struct inode *inode, unsigned long event) > > Agreed. This is looking good :) I asked Linus what he thinks about this one-line patch. > Someone suggested hooking into ->poll() as a less obtrusive way to > implement epoll. You're probably right that it's quicker to hook > directly as you have done, although it would be great if epoll could > somehow "fall back" to using ->poll() in the cases where epoll isn't > directly supported by a file object. I'm currently investigating this ... looks like an easy way to support "alien" files :) > I wrote quite a lot about futexes above. That's because good futex > support, and fallback to ->poll() would pretty much make epoll > universal. What do you think of these ideas?: > > 1. Add FUTEX_EPOLL operation to futex.c, which registers a futex > with an epoll interest list. This would cause FUTEX_WAKE > calls on that address to generate epoll events. Some care is > needed here to keep track of the exact number of events generated, > because some rather subtle usages of futex depend on the > return value from futex_wake being the _exact_ number of waiters > that are woken. It would have to correspond to the exact number > of events counted by userspace. I still believe that the 1:1 mapping is sufficent and with that in place ( and the one line patch to kernel/futex.c ) futex support comes nicely. > 2. Add a check to EP_CTL_ADD which checks whether a file supports > epoll notifications natively. Perhaps a file_operations hook > is in order here. If it does, great. If not, fall back to > a generic mechanism that uses the file's ->poll() method. (I > haven't thought through for sure how plausible this is). > Magically, every kind of fd works, including special devices, > and the things that are most performance critical (sockets, > pipes, futexes) are tuned. Yum! Yes, kind of. The hook for an efficent edge triggered event notification should be something like the socket one where you have a ->data_ready() and ->write_space(), where the caller of these callbacks know that signals has to be delivered on 0->1 transactions. With the poll hook you have the drawback that the wakeup list is invoked each time data arrives and this might generate a little bit too many events. This is no a problem since epoll collapse them, but still collapsing do cost CPU cycles. > 3. Eliminate send_sigio() calls - pass all events to epoll, and let > epoll dispatch signals where they have been registered. In > combination with (2), this magically provides SIGIO support for > all fd types as well. I would leave as a next cleanup operation, eventually. - Davide |
From: Jamie L. <lk...@ta...> - 2002-11-01 02:01:53
|
Davide Libenzi wrote: > Jamie, the futex support can be easily done with one line of code patch. I > still prefer the one-to-one mapping between futexes and files. It makes > everything easier. I do agree it is very simple and hence good. > I don't really see futex creation/destroy as an high frequency event > that might be suitable for optimization. Usually you have your own > set of resources to be "protected" and in 95% of cases you know > those resources from the beginning. Well, I'll disagree but stay mostly quiet. I think it is reasonable to have a futex per _object_ in certain language run-times. Allocation rate: 10,000,000 per second in some examples (f.e. certain kinds of threaded simulator). Hardly any of those will need associated fds, and I have no figures on how many or how often, but you can see that futexes are sometimes used in a very dynamic way because they are so cheap until contention. That's the cool thing about futexes: there's absolutely zero kernel overhead until contention, and only one "long" of overhead in user space. At contention, two syscalls resolves it synchronously: futex_wait, futex_wake. The async method using an fd with epoll takes five: futex_fd, epoll_ctl, poll, futex_wake, futex_close. That works, but lacks the _cool_ factor that futexes have IMHO. It should be: futex_wait_async, futex_wake. I realise my argument is a weak one though :) > > > Timer, as long as you access them through a file* interface ( like futexes ) > > > will become trivial too. Another line should be sufficent for dnotify : > > > > Sorry (<humble/>), ignore timers. Somehow I picked up the idea that > > epoll_wait() didn't have a timeout from some example or other, which > > was very silly of me. I've read the patch properly now! Of course > > epoll supports timers - a timeout is quite enough for user space. > > If you want to timeout I/O operations you can easily put a timer routine > in your main event scheduler loop. But I still like the idea of timers > easily accessible through a file* interface. Sure, but using file * interface implies entering the kernel - that can sometimes be skipped* if your timer queue is in user space. * - it happens under heavy load, conveniently. > > > void __inode_dir_notify(struct inode *inode, unsigned long event) > > > > Agreed. This is looking good :) > > I asked Linus what he thinks about this one-line patch. I have no objections to it. Generally, I'd like epoll to be able to report _what_ the event was (not just POLL_RDNORM, but what kind of dnotify event), but as I don't get to run on an ideal kernel [;)] I'll be happy with POLL_RDNORM. > I still believe that the 1:1 mapping is sufficent and with that in place ( > and the one line patch to kernel/futex.c ) futex support comes nicely. It does work - actually, with ->poll() you don't need any lines in futex.c :) Even if a specialised futex hook is added someday, the fd support will continue to be useful. > > 2. Add a check to EP_CTL_ADD which checks whether a file supports > > epoll notifications natively. Perhaps a file_operations hook > > is in order here. If it does, great. If not, fall back to > > a generic mechanism that uses the file's ->poll() method. (I > > haven't thought through for sure how plausible this is). > > Magically, every kind of fd works, including special devices, > > and the things that are most performance critical (sockets, > > pipes, futexes) are tuned. Yum! > > Yes, kind of. The hook for an efficent edge triggered event notification > should be something like the socket one where you have a ->data_ready() > and ->write_space(), where the caller of these callbacks know that signals > has to be delivered on 0->1 transactions. With the poll hook you have the > drawback that the wakeup list is invoked each time data arrives and this > might generate a little bit too many events. This is no a problem since > epoll collapse them, but still collapsing do cost CPU cycles. You avoid the extra CPU cycles like this: 1. EP_CTL_ADD adds the listener to the file's wait queue using ->poll(), and gets a free test of the object readiness [;)] 2. When the transition happens, the wakeup will call your function, epoll_wakeup_function. That removes the listener from the file's wait queue. Note, you won't see any more wakeups from that file. 3. When you report the event user space, _then_ you automatically add the listener back to the file's wait queue by calling ->poll(). This way, there are no spurious wakeups, and nothing to collapse. I would not be surprised if this is quite fast - perhaps as fast as the special epoll hooks. The nice feature that makes this possible is that waitqueues don't wake up tasks any more: they simply call your choice of callback function. It was changed for aio, and it's a good change. -- Jamie |
From: Davide L. <da...@xm...> - 2002-11-01 17:27:14
|
On Fri, 1 Nov 2002, Jamie Lokier wrote: > I have no objections to it. Generally, I'd like epoll to be able to > report _what_ the event was (not just POLL_RDNORM, but what kind of > dnotify event), but as I don't get to run on an ideal kernel [;)] I'll > be happy with POLL_RDNORM. See below ... > > I still believe that the 1:1 mapping is sufficent and with that in place ( > > and the one line patch to kernel/futex.c ) futex support comes nicely. > > It does work - actually, with ->poll() you don't need any lines in futex.c :) The global poll hook might work, but you can deliver anything with your callback. And you have to actually do another poll to understand which poll flags you really got. Adding the famous one-lines patches to single devices anytime there's the need, being file_notify_send() completely expandible, enable you to have a more detailed report to send back to the caller. > You avoid the extra CPU cycles like this: > > 1. EP_CTL_ADD adds the listener to the file's wait queue using > ->poll(), and gets a free test of the object readiness [;)] > > 2. When the transition happens, the wakeup will call your function, > epoll_wakeup_function. That removes the listener from the file's > wait queue. Note, you won't see any more wakeups from that file. > > 3. When you report the event user space, _then_ you automatically > add the listener back to the file's wait queue by calling ->poll(). > > This way, there are no spurious wakeups, and nothing to collapse. I > would not be surprised if this is quite fast - perhaps as fast as the > special epoll hooks. Jamie, I'm afraid it won't. This is the cost of reporting events to the user with the current epoll : ep->eventcnt = 0; ++ep->ver; if (ep->pages == ep->pages0) { ep->pages = ep->pages1; dvp->ep_resoff = 0; } else { ep->pages = ep->pages0; dvp->ep_resoff = ep->numpages * PAGE_SIZE; } Using the global poll hook you have several problem. First the poll table is not suitable to single insert/removal, all you need is a poll_wait() that, recognizing a special type of table, do insert in the wait queue differently. For example you'll have : typedef struct poll_table_struct { + int queue; + wait_queue_t *q; int error; struct poll_table_page * table; } poll_table; This togheter with : static inline void poll_initwait_ex(poll_table* pt, wait_queue_t *q, int queue) { + pt->queue = queue; + pt->q = q; pt->error = 0; pt->table = NULL; } And the : void __pollwait(struct file * filp, wait_queue_head_t * wait_address, poll_table *p) { struct poll_table_page *table = p->table; + if (!p->queue) + return; + if (p->q) { + add_wait_queue(wait_address, p->q); + return; + } if (!table || POLL_TABLE_FULL(table)) { struct poll_table_page *new_table; new_table = (struct poll_table_page *) __get_free_page(GFP_KERNEL); if (!new_table) { p->error = -ENOMEM; __set_current_state(TASK_RUNNING); return; } new_table->entry = new_table->entries; new_table->next = table; p->table = new_table; table = new_table; } /* Add a new entry */ { struct poll_table_entry * entry = table->entry; table->entry = entry+1; get_file(filp); entry->filp = filp; entry->wait_address = wait_address; init_waitqueue_entry(&entry->wait, current); add_wait_queue(wait_address,&entry->wait); } } This enable you to do two things : 1) During the EP_CTL_ADD you do : poll_table pt; poll_initwait_ex(&pt, &dpi->q, 1); file->f_op->poll(file, &pt); and this adds _your_own_ wait queue object in the file poll queue. No full blow poll_table. You need your own wait queue because when the callback ( wakeup ) is called you need to call container_of() to get the dpi* 2) Before reporting events you need to fetch _real_ poll flags for each file you received the callback from. You do : poll_table pt; poll_initwait_ex(&pt, NULL, 0); flags = file->f_op->poll(file, &pt); You really don't want to remove and add to file's poll queue _every_ time you receive an event. You're going to pay a lot for that. I'm currently coding this one to give it a try to see what kind of performances I get. With the global poll hook you won't be able to do more detailed event report that file_notify_event() enable you to do. - Davide |
From: John G. M. <jg...@ne...> - 2002-11-02 09:33:59
|
Jamie Lokier wrote: >You avoid the extra CPU cycles like this: > > 1. EP_CTL_ADD adds the listener to the file's wait queue using > ->poll(), and gets a free test of the object readiness [;)] > > 2. When the transition happens, the wakeup will call your function, > epoll_wakeup_function. That removes the listener from the file's > wait queue. Note, you won't see any more wakeups from that file. > > 3. When you report the event user space, _then_ you automatically > add the listener back to the file's wait queue by calling ->poll(). > > The cost of removing and readding the listener to the file's wait queue is part of what epoll is amortizing. There's also the oddity that I noticed this week: pipes don't report POLLOUT readiness through the classic poll interface until the pipe's buffer is completely empty. Changing this to report POLLOUT readiness when the pipe's buffer is not full apparently causes NIS to break. |
From: Mark M. <ma...@ma...> - 2002-11-02 04:53:07
|
On Fri, Nov 01, 2002 at 03:27:41PM -0800, John Gardiner Myers wrote: > There's also the oddity that I noticed this week: pipes don't report > POLLOUT readiness through the classic poll interface until the pipe's > buffer is completely empty. Changing this to report POLLOUT readiness > when the pipe's buffer is not full apparently causes NIS to break. These seems deficient. Does this mean that pipes managed via poll() are not able to maximum throughput? mark -- ma...@mi.../ma...@nc.../ma...@no... __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them... http://mark.mielke.cc/ |
From: John G. M. <jg...@ne...> - 2002-11-05 18:15:35
Attachments:
smime.p7s
|
Mark Mielke wrote: >On Fri, Nov 01, 2002 at 03:27:41PM -0800, John Gardiner Myers wrote: > > >>There's also the oddity that I noticed this week: pipes don't report >>POLLOUT readiness through the classic poll interface until the pipe's >>buffer is completely empty. Changing this to report POLLOUT readiness >>when the pipe's buffer is not full apparently causes NIS to break. >> >> > >These seems deficient. Does this mean that pipes managed via poll() are >not able to maximum throughput? > I could see this going either way, depending on the application. Holding off the POLLOUT readiness could improve performance by making sure that whenever a process is scheduled to write to a pipe the pipe has enough buffer to take all of the data. |
From: Benjamin L. <bc...@re...> - 2002-11-05 18:18:45
|
On Tue, Nov 05, 2002 at 10:15:33AM -0800, John Gardiner Myers wrote: > I could see this going either way, depending on the application. > Holding off the POLLOUT readiness could improve performance by making > sure that whenever a process is scheduled to write to a pipe the pipe > has enough buffer to take all of the data. Aio write to pipes has a distinct advantage here as the pipe code can provide the write atomicity guarantees while preserving the non-blocking aspect of the io submission interface. -ben -- "Do you seek knowledge in time travel?" |
From: Jamie L. <lk...@ta...> - 2002-11-02 15:41:42
|
John Gardiner Myers wrote: > The cost of removing and readding the listener to the file's wait queue > is part of what epoll is amortizing. Not really. The main point of epoll is to ensure O(1) processing time per event - one list add and removal doesn't affect that. It has a constant time overhead, which I expect is rather small - but Davide says he's measuring that so we'll see. > There's also the oddity that I noticed this week: pipes don't report > POLLOUT readiness through the classic poll interface until the pipe's > buffer is completely empty. Changing this to report POLLOUT readiness > when the pipe's buffer is not full apparently causes NIS to break. There's a section in the Glibc manual which talks about pipe atomicity. A pipe must guarantee that a write of PIPE_BUF bytes or less either blocks or is accepted whole. So you can't report POLLOUT just because there is room in the pipe - there must be PIPE_BUF room. Furthermore, the manual says that after writing PIPE_BUF bytes, further writes will block until some bytes are read. This latter does not seem a useful requirement to me - I think that a pipe could be larger than the PIPE_BUF atomicity value, but perhaps it is defined in POSIX or SUS to be like this. (Someone care to check?) Together these would seem to imply the behaviour noted by John. -- Jamie |
From: Jamie L. <lk...@ta...> - 2002-11-01 20:46:05
|
Davide Libenzi wrote: > > In other words, add another op to sys_futex() called FUTEX_EPOLL which > > directly registers the futex on an epoll interest list, and let epoll > > report those events as futex events. > > Jamie, the futex support can be easily done with one line of code patch. I > still prefer the one-to-one mapping between futexes and files. I forgot something important: futex notifcations must be _exactly counted_ for some uses of futexes. It's all very subtle, but there's an example in Rusty's futex library where a token is passed to one of the waiters, and waiters are queued up behind each other in the order they started waiting. (See futex_up_fair() in usersem.h). You need this to prevent starvation, with Alan's example of waiting for multiple futexes being a particularly nasty case. Because of this, and the way your one-liner works, I think* that a multi-threaded program will need to allocate one fd per waiter to guarantee the counter - not one fd per waited-upon futex. So when 1000 threads are waiting on some global mutex (as happens), they'll need an fd each - they can't share one. * - If I'm wrong about this, please someone correct me. Consequently, fds will need to be allocated when a thread wants to wait, instead of lazily once per contended futex - hence a higher rate of allocations and deallocations. The fixes for this are twofold: 1. You must change file_send_notify() so that it takes a count which limits the number of notifications (like FUTEX_WAKE), and returns the number of notifications sent. 2. The futex's queue of waiters must contain the epoll waiters _and_ waitqueue waiters, in the order that they started waiting. It's not enough to wake the epoll waiters first, and if any notifications are left, wake the others, nor vice versa. Futex epolls are a bit fiddly. -- Jamie |
From: Matthew D. H. <mh...@fr...> - 2002-11-01 01:55:54
|
If I may respectfully weigh in... If a new API and/or a significant change in semantics is to be applied to the kernel for a unified event notification system, this is obviously an issue for 2.7 or 2.9. As such, we have plenty of time to focus upon simplicity and correctness rather than plain old inertia. We need to bring a truly unified, and therefore new, event API to the kernel, and it must be done right. kevent attempts to achieve this for FreeBSD, and generally speaking, it succeeds. But linux can do much better. The API should present the notion of a general edge-triggered event (e.g. I/O upon sockets, pipes, files, timers, etc.), and it should do so simply. Linus made some suggestions on lkml back in 2000 (http://marc.theaimsgroup.com/?l=linux-kernel&m=97236943118139&w=2) that pretty much hit the nail on the head -- with some exceptions. * Unless every conceivable event is to be represented as a file (rather unintuitive IMHO), his proposed interface fails to accomodate non-I/O events (e.g. timers, signals, directory updates, etc.). As much as I appreciate the UNIX Way, making everything a file is a massive oversimplification. We can often stretch the definition far enough to make this work, but I'd be impressed to see how one intends to call, e.g., a timer a type of file. * There is a seemingly significant overhead in performing exactly one callback per event. Doesn't this prevent any kind of event coalescence? It seems like we could be doing an awful lot of cache thrashing, among other things. In some cases, this might happen anyway, but why should the interface enforce such behavior? In most other cases, it's perfectly acceptable to inline an event handler (either via compile-time inlining or literally). I do realize that Linus only suggested that the C library do the mass callbacks, BTW, so strictly speaking, it's the userland API that would "enforce such behavior." Nonetheless, this is of concern. Enough of what we shouldn't do. Here's what we should: * The interface should allow the implementation to be highly extensible without horrible code contortions within the kernel. It is important to be able to add new types of events as they become necessary. The interface should be general and simple enough to accomodate these extensions. Linux (really, UNIX) has failed to exercise this foresight in the past, and that's why we have the current mess of varied polling/triggering methods. * I might be getting a bit utopian here, but IMHO the kernel should move toward a completely edge-triggered event system. The old level-triggered interfaces should only wrap this paradigm. * Might as well reiterate: simplicity. FreeBSD's kevent solves nearly all of the traditional problems, but it is gross. And complicated. There were clearly too many computer scientists and not enough engineers on that team. * Only one queue per process or kernel thread. Multiple queues per flow of execution is just ugly and ultimately pointless. That is not to say that you can't multithread, but each thread simply uses the same queue. In cases when you want one thread to only wait on a certain type(s) of event, you can do this as well; you just can't have one thread juggling more than one queue. Since the event notification is edge-triggered, I cannot see any reason for a significant performance degradation resulting from this policy. I am not altogether convinced that this must be a strict policy, however; the issue of different userspace threads having different event queues inside the same task is interesting. * No re-arming events. They must be manually killed. * I'm sure everyone would agree that passing an opaque "user context" pointer is necessary with each event. * Zero-copy event delivery (of course). Some question marks: - Should the kernel attempt to prune the queue of "cancelled" events (hints later deemed irrelevant, untrue, or obsolete by newer events)? - Is one-queue-per-task really the way to go? This stops many bad practices but could prevent some decent ones (see user threads comment). Matthew D. Hall |
From: Davide L. <da...@xm...> - 2002-11-01 02:44:27
|
On Thu, 31 Oct 2002, Matthew D. Hall wrote: > * Unless every conceivable event is to be represented as a file (rather > unintuitive IMHO), his proposed interface fails to accomodate non-I/O > events (e.g. timers, signals, directory updates, etc.). As much as I > appreciate the UNIX Way, making everything a file is a massive > oversimplification. We can often stretch the definition far enough to > make this work, but I'd be impressed to see how one intends to call, > e.g., a timer a type of file. The fact that a timer is a file garanties you the usage of an existing infrastructure and existing APIs to use it. For example epoll_create(2) returns you a file descriptor, and this enable you ( for example ) to drop this file descriptor inside a poll set. Also you get the cleanup infrastructure that otherwise you would have to code every time, for each new object that you create, by yourself. Something like : int timer_create(void); int timer_set(struct timespec *ts); and you can use epoll or poll to get the timer event, and close(2) to destroy it. You get automatic compatibility with lots of nice stuff having an object that is actually a file and I usually like it as idea. > * I'm sure everyone would agree that passing an opaque "user context" > pointer is necessary with each event. I asked this about a week ago. It's _trivial_ to do in epoll. I did not receive any feedback, so I didn't implement it. Feedback will be very much appreciated here ... - Davide |
From: Dan K. <da...@ke...> - 2002-11-01 18:02:51
|
Davide Libenzi wrote: >>* I'm sure everyone would agree that passing an opaque "user context" >>pointer is necessary with each event. > > I asked this about a week ago. It's _trivial_ to do in epoll. I did not > receive any feedback, so I didn't implement it. Feedback will be very much > appreciated here ... If it's cheap, do it! It relieves the programmer of having to manage a fd to object lookup table. - Dan |
From: Jamie L. <lk...@ta...> - 2002-11-01 02:56:28
|
Matthew D. Hall wrote: > The API should present the notion of a general edge-triggered event > (e.g. I/O upon sockets, pipes, files, timers, etc.), and it should do so > simply. Agreed. Btw, earlier today I had misgivings about epoll, but since I've had such positive response from Davide I think epoll has potential to become that ideal interface... > * Unless every conceivable event is to be represented as a file (rather > unintuitive IMHO), his proposed interface fails to accomodate non-I/O > events (e.g. timers, signals, directory updates, etc.).# ...apart from this one point! > As much as I appreciate the UNIX Way, making everything a file is a > massive oversimplification. We can often stretch the definition far > enough to make this work, but I'd be impressed to see how one > intends to call, e.g., a timer a type of file. If it has an fd, that is, if it has an index into file_table, then it's a "file". No other semantics are required for event purposes. This seems quite weird and pointless at first, but actually fds are quite useful: you can dup them and pass them between processes, and they have a security model (you can't use someone else's fd unless they've passed it to you). Think of an fd as a handle to an arbitrary object. OTOH look at rt-signals: a complete mess, no kernel allocation mechanism, libraries fight it out and don't always work. Look at aio: it has an aio_context_t - IMHO that should be an fd, not an opaque number that cannot be transferred to another process or polled on. However, despite all the goodness of fds, you're right. Event queues really need to deliver more info than which event and read/write/hangup. dnotify events should include what happened and maybe the inode number. futex events should include the address. (rt-signals get close to this but fail due to pseudo-compatibility with a braindead POSIX standard). > * There is a seemingly significant overhead in performing exactly one > callback per event. Doesn't this prevent any kind of event coalescence? As you go on to say, this should be a matter for userspace. My concern is that kernel space should provide a flexible mechanism for a variety of possible userspaces. > * The interface should allow the implementation to be highly extensible > without horrible code contortions within the kernel. It is important to > be able to add new types of events as they become necessary. The > interface should be general and simple enough to accomodate these > extensions. Linux (really, UNIX) has failed to exercise this foresight > in the past, and that's why we have the current mess of varied > polling/triggering methods. Agreed, agreed, agreed, agreed. Fwiw, I now think these can all be satisfied with some evolution of epoll, if that is permitted. > * I might be getting a bit utopian here, but IMHO the kernel should > move toward a completely edge-triggered event system. The old > level-triggered interfaces should only wrap this paradigm. Take a close look at how wait queues and notifier chains are used. Some of the kernel is edge triggered already. Admittedly, it's about as clear as clay at times - some wait queues are used in an edge-triggered way, others level-triggered. > * Might as well reiterate: simplicity. FreeBSD's kevent solves nearly > all of the traditional problems, but it is gross. And complicated. Could you explain what you find complicated and/or gross about kevent? We should learn from their mistakes. > There were clearly too many computer scientists and not enough > engineers on that team. I am both ;) > * Only one queue per process or kernel thread. Multiple queues per > flow of execution is just ugly and ultimately pointless. Disagree. You're right that it's technically not necessary to have multiple queues, but in practice you can't always force an entire program to funnel all its I/O through one library - that just doesn't happen in reality. And there is basically no cost to having multiple queues. Keyed off fds :) That was a mistake made by rt-signals: assuming all the different bits of code that use rt-signals will cooperate. Theoretically solvable in user space. In reality, they don't know about each other. Although my code at least searches for a free rt-signal, that's not guaranteed to work if another library assumes a fixed value. The same problem occurs with the LDT. Glibc wants to use it and so do I. Conflict. > Since the event notification is edge-triggered, I cannot see any > reason for a significant performance degradation resulting from this > policy. I am not altogether convinced that this must be a strict > policy, however; the issue of different userspace threads having > different event queues inside the same task is interesting. User space threads are often but not always built on top of a simple scheduler which converts blocking system calls to queued non-blocking system calls. If done well, this is a form of virtualisation which may even be nestable. You'd expect the event queue mechanism to be included in the set of blocking system calls which are converted, so multiple userspace threads would "see" individual queues even though they are multiplexed by the userspace scheduler. This works great, until those threads expect mmap() to provide them with their own separate zero-copy event queues :) So another reason to need multiple queues from the kernel. > * No re-arming events. They must be manually killed. I would provide both options, like dnotify does: one-shot and continuous. There are occasions when one-shot is better for resource usage - it depends what the event is monitoring. > * I'm sure everyone would agree that passing an opaque "user context" > pointer is necessary with each event. It is not the end of the world to use an fd number and a table, but it may improve thread scalability to use a pointer instead. > * Zero-copy event delivery (of course). I think this is not critical for performance, but desirable anyway. I would take this further: 1. zero-copy delivery 2. zero system calls as long as the queue is non-empty (like the packet socket mmap interface) 3. no fixed limit on the size of the queue at creation time Neither epoll nor aio satisfy (3). Luckily I have a nice design which satisfies all these and is extensible in the ways we agree on. > Some question marks: > - Should the kernel attempt to prune the queue of "cancelled" events > (hints later deemed irrelevant, untrue, or obsolete by newer events)? Something is needed in the case of aio cancellations, but I think that's different to what you mean. Backtracking hints is quite difficult to synchronise with userspace if done through mmap and no system calls. It's best not to bother. Coalescing events, which can have the effect of superceding events in some cases, is a possibility. It's tricky but more worthwhile than backtracking. For some kinds of event, such as round robin futex wakeups, it's critically important that the _number_ of events seen by userspace is the same as the number sent from the kernel. In these cases, they are not just hints, they are synchronisation tokens. That adds some excitement to coalescing in a shared memory buffer. -- Jamie |
From: Mark M. <ma...@ma...> - 2002-11-01 04:27:45
|
On Thu, Oct 31, 2002 at 11:02:15PM +0000, Jamie Lokier wrote: > The semantics for this are a bit confusing and inconsistent with > poll(). User gets POLL_RDNORM event which means something in the > directory has changed, not that the directory is now readable or that > poll() would return POLL_RDNORM. It really should be a different > flag, made for the purpose. Don't be encouraging any of us to expect the ability to poll() for changes to regular files (log file parsers that sit on EOF periodically polling for further data...). Just get *something* decent out so that we can play with it in a production environment. I would put off extensions such as this until the API is well established. mark -- ma...@mi.../ma...@nc.../ma...@no... __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them... http://mark.mielke.cc/ |
From: Jamie L. <lk...@ta...> - 2002-11-01 04:59:37
|
Mark Mielke wrote: > On Thu, Oct 31, 2002 at 11:02:15PM +0000, Jamie Lokier wrote: > > The semantics for this are a bit confusing and inconsistent with > > poll(). User gets POLL_RDNORM event which means something in the > > directory has changed, not that the directory is now readable or that > > poll() would return POLL_RDNORM. It really should be a different > > flag, made for the purpose. > > Don't be encouraging any of us to expect the ability to poll() for changes > to regular files (log file parsers that sit on EOF periodically polling for > further data...). Actually you can already do something similar, if a little coarse grained, in 2.4 kernels using dnotify on the parent directory. > Just get *something* decent out so that we can play with it in a > production environment. I would put off extensions such as this > until the API is well established. "something decent" is already out - epoll is quite useful in its present form. (Take that with a grain of salt - I haven't tried it, and it only just went into 2.4.45, and I have the impression Davide is cleaning up the code for 2.4.46 - but it looks basically ok). -- Jamie |
From: John G. M. <jg...@ne...> - 2002-11-02 05:19:38
|
Matthew D. Hall wrote: > * There is a seemingly significant overhead in performing exactly one > callback per event. The "exactly one callback per event" semantics of aio are important for cancellation in thread pool environments. When you're shutting down a connection, you need to be able to get to a point where you know no other thread is processing or will process an event for the connection, so it is safe to free the connection state. > * Only one queue per process or kernel thread. Having a single thread process multiple queues is not particularly interesting (unless you have user-space threads or coroutines). Being able to have different threads in the same process process different queues is interesting--it permits a library to set up its own queue, using its own threads to process it. > * No re-arming events. They must be manually killed. Rearming events is a useful way to get the correct cancellation semantics in thread pool environments. > - Should the kernel attempt to prune the queue of "cancelled" events > (hints later deemed irrelevant, untrue, or obsolete by newer events)? This makes the cancellation semantics much easier to deal with in single threaded event loops. Single threaded cancellation is difficult in the current aio interface because in the case where the canceled operation already has an undelivered event in the queue, the canceling code has to defer freeing the context until it receives that event. An additional point: In a thread pool environment, you want event wakeup to be in LIFO order and use wake-one semantics. You also want concurrency control: don't deliver an event to a waiting thread if that pool does not have fewer threads in runnable state than CPUs. |
From: Rusty R. <ru...@ru...> - 2002-10-31 23:52:18
|
On 31 Oct 2002 16:45:58 +0000 Alan Cox <al...@lx...> wrote: > What is hard is multiple futex waits and livelock for that. I think it > can be done properly but I've not sat down and designed it all out - I > wonder what Rusty thinks. Hmm... Never thought about it. You mean an API like: struct futex_set *futex_set_init(); struct futex_set *futex_set_add(struct futex_set *, struct futex *); /* Returns futex obtained. */ struct futex *futex_set_wait(struct futex_set *); I think a naive implementation of futex_set_wait would look like: set = futex_set try: for each futex in set { if (grab in userspace) { close fds; return with futex; } close old fd for futex if any call FUTEX_FD to get fd notification of futex; } select on fds set = fds which are ready goto try You could, of course, loop through the fast path once before making any syscalls. Another optimization is to have FUTEX_FD reuse an existing fd rather than requiring the close. Not sure I get the point about livelock though: deadlock is possible if apps seek multiple locks at once without care, of course. Rusty. -- there are those who do and those who hang on and you don't see too many doers quoting their contemporaries. -- Larry McVoy |
From: Jamie L. <lk...@ta...> - 2002-11-01 00:32:59
|
Rusty Russell wrote: > I think a naive implementation of futex_set_wait would look like: Vaguely. We are looking for something with the queue-like semantics of epoll and rt-signals: persistent (as opposed to one-shot) listening, ordered delivery of events, scalable listening to thousands at once (without the poll/select O(n) problem). > Not sure I get the point about livelock though: deadlock is possible if > apps seek multiple locks at once without care, of course. I'm not sure what Alan meant either. -- Jamie |
From: Alan C. <al...@lx...> - 2002-11-01 13:03:15
|
On Thu, 2002-10-31 at 22:00, Rusty Russell wrote: > try: > for each futex in set { > if (grab in userspace) { > close fds; > return with futex; > } > close old fd for futex if any > call FUTEX_FD to get fd notification of futex; > } > > select on fds > set = fds which are ready > goto try > > You could, of course, loop through the fast path once before making any > syscalls. Another optimization is to have FUTEX_FD reuse an existing fd > rather than requiring the close. > > Not sure I get the point about livelock though: deadlock is possible if > apps seek multiple locks at once without care, of course. Think about 1000 futexes where one task wants to grab them all and other tasks want any one of them - done wrongly at that point busy traffic will starve the 1000 futex waiter. |
From: Zach B. <za...@za...> - 2002-10-30 18:59:26
|
> It is very easy for me to remain calm here. You're a funny guy. You're in > the computer science by many many years and still you're not able to > understand how edge triggered events works. And look, this apply to every > field, form ee to cs. Book suggestions would be requested here, but since > I believe grasping inside a technical library to be pretty fun, I'll leave > you this pleasure. http://www.infidels.org/news/atheism/logic.html#hominem I know its hard, but can we try and avoid the most pathetic pitfalls of arguing over email? - z |
From: Davide L. <da...@xm...> - 2002-10-30 19:15:51
|
On Wed, 30 Oct 2002, Zach Brown wrote: > > It is very easy for me to remain calm here. You're a funny guy. You're in > > the computer science by many many years and still you're not able to > > understand how edge triggered events works. And look, this apply to every > > field, form ee to cs. Book suggestions would be requested here, but since > > I believe grasping inside a technical library to be pretty fun, I'll leave > > you this pleasure. > > http://www.infidels.org/news/atheism/logic.html#hominem > > I know its hard, but can we try and avoid the most pathetic pitfalls of > arguing over email? Zach, on one side it's very easy for me. I just won't reply. This should cut this very short. Looking at the whole thread you'll find that he wanted to pass his non agreement with the interface, that is a pretty normal and legitimate thing, for a bug of the interface. Now, while non agreement imply a very own subjective way to see a thing, a bug means a very objective thing. That is, "it does not work". Now, when someone state something that is proven to be false ( it's not even an RTQA, it's a NOT-A-BUG ), and when this someone tried in every way to kill the interface ( for reason that I'm not aware about ), and also implied that "I do not understand", well I've been educated to respond. Look, I'm a very simple guy. You don't touch me and I'll be transparent like a ghost for you. You touch me personally, and I retaliate. - Davide |
From: Davide L. <da...@xm...> - 2002-10-31 16:44:59
|
On Wed, 30 Oct 2002, Zach Brown wrote: > > It is very easy for me to remain calm here. You're a funny guy. You're in > > the computer science by many many years and still you're not able to > > understand how edge triggered events works. And look, this apply to every > > field, form ee to cs. Book suggestions would be requested here, but since > > I believe grasping inside a technical library to be pretty fun, I'll leave > > you this pleasure. > > http://www.infidels.org/news/atheism/logic.html#hominem > > I know its hard, but can we try and avoid the most pathetic pitfalls of > arguing over email? Zach, it's just me or I received this one twice :) - Davide |
From: Shilpa <shi...@hs...> - 2007-03-21 11:20:17
|
I need a sample code for do_use_fd? It seems the EAGAIN logic has to be built within this by the application |