Thread: [Lse-tech] Re: and nicer too - Re: [PATCH] epoll more scalable than poll (Page 3)

lse-tech

[Lse-tech] Re: and nicer too - Re: [PATCH] epoll more scalable than poll

From: Davide L. <da...@xm...> - 2002-10-29 20:54:06

On Tue, 29 Oct 2002, John Gardiner Myers wrote:

> >I bet Davide knows best.
> >
> Nope, he doesn't.

It is very easy for me to remain calm here. You're a funny guy. You're in
the computer science by many many years and still you're not able to
understand how edge triggered events works. And look, this apply to every
field, form ee to cs. Book suggestions would be requested here, but since
I believe grasping inside a technical library to be pretty fun, I'll leave
you this pleasure.

> >An easy solution is to have sys_epoll_ctl check if there is there is data
> >ready and make sure there is an edge to report in that case to the next call
> >of sys_epoll_ctl().
> >
> >
> This is the very solution I am proposing.

This is an example snippet code that can be used with the current API :

for(;;) {
        nfds = sys_epoll_wait(kdpfd, &pfds, -1);

        for(n = 0; n < nfds; ++n) {
                if(fd = pfds[n].fd) == s) {
                        client = accept(s, (struct sockaddr*)&local, &addrlen);
                        if(client < 0){
                                perror("accept");
                                continue;
                        }
                        if (sys_epoll_ctl(kdpfd, EP_CTL_ADD, client, POLLIN | POLLOUT) < 0) {
                                fprintf(stderr, "sys_epoll set insertion error: fd=%d\n", client);
                                return -1;
                        }
                        fd = client;
                }
                do_use_fd(fd);
        }
}

This is what will be used in case of your
failing-to-understand-edge-triggered-api method :

for(;;) {
        nfds = sys_epoll_wait(kdpfd, &pfds, -1);

        for(n = 0; n < nfds; ++n) {
                if(fd = pfds[n].fd) == s) {
                        client = accept(s, (struct sockaddr*)&local, &addrlen);
                        if(client < 0){
                                perror("accept");
                                continue;
                        }
                        if (sys_epoll_ctl(kdpfd, EP_CTL_ADD, client, POLLIN | POLLOUT) < 0) {
                                fprintf(stderr, "sys_epoll set insertion error: fd=%d\n", client);
                                return -1;
                        }
                } else
                        do_use_fd(fd);
        }
}

Why the heck ( and this for the 100th time ) do you want to go to wait for
an event on the newly born fd if :

1) On connect() you have the _full_ write I/O space available
2) On accept() it's very likely the you'll find something more than a SYN
	in the first packet

Besides, the first code is even more cleaner and simmetric, while adopting
your half *ss solution might suggest the user that he can go waiting for
events any time he wants. Like going to sleep the the wait queue of IDE
disk w/out having issued any command. Now to bring this 101, consider :

1) "issuing a command to an IDE disk" == "using read/write until EAGAIN"

2) "adding yourself on the IDE disk wait queue" == "calling sys_epoll_wait()"

PS: since my time is not infinite, and since I'm working on the changes we
agreed with Andrew I would suggest you either to take another look at the
code suggesting us new changes ( like you did yesterday ) or to go
shopping for books.

- Davide

[Lse-tech] Re: and nicer too - Re: [PATCH] epoll more scalable than poll

From: Jamie L. <lk...@ta...> - 2002-10-30 00:26:55

> 1) "issuing a command to an IDE disk" == "using read/write until EAGAIN"
> 2) "adding yourself on the IDE disk wait queue" == "calling sys_epoll_wait()"

That is quite a good analogy.  epoll is like a waitqueue - which is
also like a futex.  To use a waitqueue properly you have do these
things in the order shown:

	1. Set the task state to stopped.
	2. Register yourself on the waitqueue.
	3. Check the condition.
	4. If condition is not met, schedule.

With epoll it is very similar.  To wait for a condition on a file
descriptor, such as readability, you must do these things in the order
shown:

	1. Register your interest using epoll_ctl.
	2. Check the condition by actually calling read().
	3. If the condition is not met (i.e. read() returned EAGAIN),
	   call epoll_wait (i.e. equivalent to schedule).

With epoll, you can optimise by registering interest just once.  In
other words, steps 2 and 3 may be repeated without repeating step 1.

And if you are concerned about starvation -- that is, one of your file
descriptors always has new data so others don't get a chance to be
serviced -- don't be.  You don't have to completely read one fd until
you see EGAIN.  All that matters is that until you see the EAGAIN,
your user space data structure should have a flag that says the fd is
still readable, so another epoll event is not expected or required for
that fd.

-- Jamie

[Lse-tech] Re: and nicer too - Re: [PATCH] epoll more scalable than poll

From: Davide L. <da...@xm...> - 2002-10-30 02:00:41

On Wed, 30 Oct 2002, Jamie Lokier wrote:

> > 1) "issuing a command to an IDE disk" == "using read/write until EAGAIN"
> > 2) "adding yourself on the IDE disk wait queue" == "calling sys_epoll_wait()"
>
> That is quite a good analogy.  epoll is like a waitqueue - which is
> also like a futex.  To use a waitqueue properly you have do these
> things in the order shown:
>
> 	1. Set the task state to stopped.
> 	2. Register yourself on the waitqueue.
> 	3. Check the condition.
> 	4. If condition is not met, schedule.
>
> With epoll it is very similar.  To wait for a condition on a file
> descriptor, such as readability, you must do these things in the order
> shown:
>
> 	1. Register your interest using epoll_ctl.
> 	2. Check the condition by actually calling read().
> 	3. If the condition is not met (i.e. read() returned EAGAIN),
> 	   call epoll_wait (i.e. equivalent to schedule).
>
> With epoll, you can optimise by registering interest just once.  In
> other words, steps 2 and 3 may be repeated without repeating step 1.
>
> And if you are concerned about starvation -- that is, one of your file
> descriptors always has new data so others don't get a chance to be
> serviced -- don't be.  You don't have to completely read one fd until
> you see EGAIN.  All that matters is that until you see the EAGAIN,
> your user space data structure should have a flag that says the fd is
> still readable, so another epoll event is not expected or required for
> that fd.

Jamie,

can I pay you a beer ? Your comment describe perfectly the API. You can
replace read() with write() in your description, and the whole thing is
still true.



- Davide

[Lse-tech] Re: and nicer too - Re: [PATCH] epoll more scalable than poll

From: John G. M. <jg...@ne...> - 2002-10-30 02:22:42

Davide Libenzi wrote:

>You're in the computer science by many many years and still you're not able to
>understand how edge triggered events works.
>
Failure to agree does not imply failure to understand.  I understand the 
model you want to apply to this problem, I do not agree that it is the 
best model to apply to this problem.

>Why the heck ( and this for the 100th time ) do you want to go to wait for
>an event on the newly born fd if :
>
>1) On connect() you have the _full_ write I/O space available
>2) On accept() it's very likely the you'll find something more than a SYN
>	in the first packet
>
>Besides, the first code is even more cleaner and simmetric, while adopting
>your half *ss solution might suggest the user that he can go waiting for
>events any time he wants.
>
The first code is hardly cleaner and is definitely not symmetric--the 
way the accept code has to set up to fall through the do_use_fd() code 
is subtle.  In the first code, the accept segment cannot be cleanly 
pulled into a callback:

for (;;) {
        nfds = sys_epoll_wait(kdpfd, &pfds, -1);
        for(n = 0; n < nfds; ++n) {
                (cb[pfds[n].fd])(pfds[n].fd);
        }
}


Also, your first code does not fit your "edge triggered" model--the code 
for handling 's' does not drain its input.  By the time you call 
accept(), there could be multiple connections ready to be accepted.

Your connect() argument is not applicable to "server sends first" 
protocols.  I suspect you are being overly optimistic about the 
likelihood of getting data with SYN, but whatever.  The argument is 
basically that not delivering an event upon registration (and thus 
having the event be implicit) improves performance because the socket is 
going to be ready with sufficiently high probability.  I would counter 
that the cost of explicitly delivering such an event is miniscule 
compared to the rest of the cost of connection setup and teardown--the 
optimization is not worthwhile.

>Like going to sleep the the wait queue of IDE
>disk w/out having issued any command.
>
The key difference between this interface and wait queues is that with 
wait queues it is not technically feasible to both register interest and 
test the condition in a single, atomic operation.  epoll does not have 
this technical limitation, so it can provide a better interface.

>PS: since my time is not infinite, and since I'm working on the changes we
>agreed with Andrew I would suggest you either to take another look at the
>code suggesting us new changes ( like you did yesterday ) or to go
>shopping for books.
>
I am uncomfortable with the way the epoll code adds its own set of 
notification hooks into the socket and pipe code.  Much better would be 
to extend the existing set of notification hooks, like the aio poll code 
does.  That would reduce the risk of kernel bugs where some subsystem 
fails to deliver an event to one but not all types of poll notification 
hooks and it would minimize the cost of the epoll patch when epoll is 
not being used.

>  
>

[Lse-tech] Re: and nicer too - Re: [PATCH] epoll more scalable than poll

From: Davide L. <da...@xm...> - 2002-10-30 03:42:27

On Tue, 29 Oct 2002, John Gardiner Myers wrote:

> Failure to agree does not imply failure to understand.  I understand the
> model you want to apply to this problem, I do not agree that it is the
> best model to apply to this problem.

John, your first post about epoll was "the interface has a bug, please
do not merge it". Now either you have a strange way to communicate non
agreement or it is something more than that. Or maybe you wanted just
blindly kill the interface with your comments because you're totally
committed to another one currently. And the existance of an interface that
might work as well, or maybe better in some cases, could create you some
problem that I'm unaware about.



> >Why the heck ( and this for the 100th time ) do you want to go to wait for
> >an event on the newly born fd if :
> >
> >1) On connect() you have the _full_ write I/O space available
> >2) On accept() it's very likely the you'll find something more than a SYN
> >	in the first packet
> >
> >Besides, the first code is even more cleaner and simmetric, while adopting
> >your half *ss solution might suggest the user that he can go waiting for
> >events any time he wants.
> >
> The first code is hardly cleaner and is definitely not symmetric--the
> way the accept code has to set up to fall through the do_use_fd() code
> is subtle.  In the first code, the accept segment cannot be cleanly
> pulled into a callback:
>
> for (;;) {
>         nfds = sys_epoll_wait(kdpfd, &pfds, -1);
>         for(n = 0; n < nfds; ++n) {
>                 (cb[pfds[n].fd])(pfds[n].fd);
>         }
> }

Sorry, what prevents you in coding that ? If you, instead of ranting
because epoll does not fit your personal idea of event notification, took
a look to the example http server used for the test ( coroutine based )
you'll see that does exactly that. Ok, it's a mess because it supports 5
interfaces, all #ifdef'ed, but the concept is there.



> Also, your first code does not fit your "edge triggered" model--the code
> for handling 's' does not drain its input.  By the time you call
> accept(), there could be multiple connections ready to be accepted.

I really don't believe this. Are you just trolling or what ? It is clear
that your acceptor routine has to do a little more work than that in a
real program. Again, looking at the example http server might help you.
This is what the acceptor coroutine does in such _trivial_ http server :

static void *dph_acceptor(void *data)
{
        struct dph_conn *conn = (struct dph_conn *) data;
        struct sockaddr_in addr;
        int sfd, addrlen = sizeof(addr);

        while ((sfd = dph_accept(conn, (struct sockaddr *) &addr, &addrlen)) != -1) {
                if (dph_new_conn(sfd, dph_httpd) < 0) {
                        dph_close(sfd);

                }
        }
        return data;
}

and this is dph_accept :

int dph_accept(struct dph_conn *conn, struct sockaddr *addr, int *addrlen)
{
        int sfd, flags = 1;

        while ((sfd = accept(conn->sfd, addr, (socklen_t *) addrlen)) < 0) {
                if (errno == EINTR)
                        continue;
                if (errno != EAGAIN && errno != EWOULDBLOCK)
                        return -1;
                conn->events = POLLIN;
                co_resume(conn);
        }
        if (ioctl(sfd, FIONBIO, &flags) &&
                ((flags = fcntl(sfd, F_GETFL, 0)) < 0 ||
                 fcntl(sfd, F_SETFL, flags | O_NONBLOCK) < 0)) {
                close(sfd);
                return -1;
        }
        return sfd;
}

and this is dph_new_conn :

static int dph_new_conn(int sfd, void *func)
{
        struct dph_conn *conn = (struct dph_conn *) malloc(sizeof(struct dph_conn));
        struct pollfd pfd;

        if (!conn)
                return -1;

        DBL_INIT_LIST_HEAD(&conn->lnk);
        conn->sfd = sfd;
        conn->events = POLLIN | POLLOUT;
        conn->revents = 0;
        if (!(conn->co = co_create(func, NULL, stksize))) {
                free(conn);
                return -1;
        }

        DBL_LIST_ADDT(&conn->lnk, &chash[sfd % chash_size]);

        if (epoll_ctl(kdpfd, EP_CTL_ADD, sfd, POLLIN | POLLOUT) < 0) {
                DBL_LIST_DEL(&conn->lnk);
                co_delete(conn->co);
                free(conn);
                return -1;

        }

        co_call(conn->co, conn);

        return 0;
}

Oh ... I forgot the scheduler :

static int dph_scheduler(int loop, unsigned int timeout)
{
        int ii, nfds;
        struct dph_conn *conn;
        struct pollfd const *pfds;

        do {
                nfds = sys_epoll_wait(kdpfd, &pfds, timeout * 1000);

                for (ii = 0; ii < nfds; ii++, pfds++) {
                        if ((conn = dph_find(pfds->fd))) {
                                conn->revents = pfds->revents;

                                if (conn->revents & conn->events)
                                        co_call(conn->co, conn);
                        }
                }
        } while (loop);
        return 0;
}

And just to make it complete, those are read/write :

int dph_read(struct dph_conn *conn, char *buf, int nbyte)
{
        int n;

        while ((n = read(conn->sfd, buf, nbyte)) < 0) {
                if (errno == EINTR)
                        continue;
                if (errno != EAGAIN && errno != EWOULDBLOCK)
                        return -1;
                conn->events = POLLIN;
                co_resume(conn);
        }
        return n;
}

int dph_write(struct dph_conn *conn, char const *buf, int nbyte)
{
        int n;

        while ((n = write(conn->sfd, buf, nbyte)) < 0) {
                if (errno == EINTR)
                        continue;
                if (errno != EAGAIN && errno != EWOULDBLOCK)
                        return -1;
                conn->events = POLLOUT;
                co_resume(conn);
        }
        return n;
}


The functions co_resume() and co_call() are the coroutine suspend and
call. The one I'm using is this :

http://www.goron.de/~froese/coro/

but coroutine implementation is trivial. You can change the same
implementation to use an I/O driven state machine and the result does not
change.



> I am uncomfortable with the way the epoll code adds its own set of
> notification hooks into the socket and pipe code.  Much better would be
> to extend the existing set of notification hooks, like the aio poll code
> does.  That would reduce the risk of kernel bugs where some subsystem
> fails to deliver an event to one but not all types of poll notification
> hooks and it would minimize the cost of the epoll patch when epoll is
> not being used.

Doh ! John, did you actually read the code ? Could you compare AIO level
of intrusion inside the kernel code with the epoll one ? It fits _exactly_
the rt-signal hooks. One of the design goals for me was to add almost
nothing on the main path. You can lookup here for a quick compare between
aio poll and epoll for a test where events delivery efficency does matter
( pipetest ) :

http://lse.sourceforge.net/epoll/index.html

Now, I don't believe that a real world app will exchange 300000 tokens per
second through a pipe, but this help you to understand the efficency of
the epoll event notification subsystem.




- Davide

[Lse-tech] Re: and nicer too - Re: [PATCH] epoll more scalable than poll

From: John G. M. <jg...@ne...> - 2002-10-31 02:08:11

Davide Libenzi wrote:

>John, your first post about epoll was "the interface has a bug, please
>do not merge it".
>
My first post about epoll pointed out how it was designed for single 
threaded callers and concluded:

I certainly hope /dev/epoll itself doesn't get accepted into the kernel, 
the interface is error prone.  Registering interest in a condition when 
the condition is already true should immediately generate an event, the 
epoll interface did not do that last time I saw it discussed.  This 
deficiency in the interface requires callers to include more complex 
workaround code and is likely to result in subtle, hard to diagnose bugs.

I did not say "the interface has a bug", I said that the interface is 
error prone.  This is a deficiency that should be fixed before the 
interface is added to the kernel.

>Sorry, what prevents you in coding that ? If you, instead of ranting
>because epoll does not fit your personal idea of event notification, took
>a look to the example http server used for the test ( coroutine based )
>you'll see that does exactly that.
>
You posted code which you claimed was "even more cleaner and simmetric" 
(sic) because it fell through to the do_use_fd() code instead of putting 
the do_use_fd() code in an else clause.  A callback scheme is akin to 
the if/else structure.  To adapt the first code to a callback scheme, 
the accept callback has to somehow arrange to call the do_use_fd() 
callback before returning to the event loop.  This requirement is subtle 
and asymmetric.

>I really don't believe this. Are you just trolling or what ? It is clear
>that your acceptor routine has to do a little more work than that in a
>real program.
>
Basically, you spawn off another coroutine.  That complicates the "fall 
through to do_use_fd()" logic in the first code by requiring an external 
facility not required by the second code.  The second code could simply 
have the accept code loop until EAGAIN.

>Doh ! John, did you actually read the code ?
>
Yes, indeed.

>Could you compare AIO level
>of intrusion inside the kernel code with the epoll one ?
>
Aio poll extends the existing set of poll notification hooks with a 
callback mechanism.  It then plugs into this callback mechanism in order 
to deliver events.  The end result is that the same notification hooks 
are used for classic poll and aio poll.  When aio poll is not being 
used, there is no additional performance penalty other than a slightly 
larger poll_table_entry and poll_table_page.

Epoll creates a new callback mechanism and plugs into this new callback 
mechansim.  It adds a new set of notification hooks which feed into this 
new callback mechansim.  The end result is that there is one set of 
notification hooks for classic poll and another set for epoll.  When 
epoll is not being used, the poll and socket code makes an additional 
set of checks to see that nobody has registered interest through the new 
callback mechanism.

> It fits _exactly_
>the rt-signal hooks. One of the design goals for me was to add almost
>nothing on the main path. You can lookup here for a quick compare between
>aio poll and epoll for a test where events delivery efficency does matter
>( pipetest ) :
>
This is a comparison of the cost of using epoll to the cost of using aio 
in one particular situation.  It is irrelevant to the point I was making.

>Now, I don't believe that a real world app will exchange 300000 tokens per
>second through a pipe, but this help you to understand the efficency of
>the epoll event notification subsystem.
>  
>
My understanding of the efficiency of the epoll event notification 
subsystem is:

1) Unlike the current aio poll, it amortizes the cost of interest 
registration/deregistration across multiple events for a given connection.

2) It declares multithreaded use out of scope, making optimizations that 
are only appropriate for use by single threaded callers.

[Lse-tech] Re: and nicer too - Re: [PATCH] epoll more scalable than poll

From: Davide L. <da...@xm...> - 2002-10-31 03:11:56

On Wed, 30 Oct 2002, John Gardiner Myers wrote:

> You posted code which you claimed was "even more cleaner and simmetric"
> (sic) because it fell through to the do_use_fd() code instead of putting
> the do_use_fd() code in an else clause.  A callback scheme is akin to
> the if/else structure.  To adapt the first code to a callback scheme,
> the accept callback has to somehow arrange to call the do_use_fd()
> callback before returning to the event loop.  This requirement is subtle
> and asymmetric.

A callback scheme can be _trivially_ implemented use the current epoll.
I'm sure you know exactly how to do it, so I'm not spending more time
explaining it to you.



> Basically, you spawn off another coroutine.  That complicates the "fall
> through to do_use_fd()" logic in the first code by requiring an external
> facility not required by the second code.  The second code could simply
> have the accept code loop until EAGAIN.

No it does not, you always fall through do_use_fd(). It's that simple.



> Epoll creates a new callback mechanism and plugs into this new callback
> mechansim.  It adds a new set of notification hooks which feed into this
> new callback mechansim.  The end result is that there is one set of
> notification hooks for classic poll and another set for epoll.  When
> epoll is not being used, the poll and socket code makes an additional
> set of checks to see that nobody has registered interest through the new
> callback mechanism.

Where epoll hooks has nothing to do with ->f_po->poll()



> > It fits _exactly_
> >the rt-signal hooks. One of the design goals for me was to add almost
> >nothing on the main path. You can lookup here for a quick compare between
> >aio poll and epoll for a test where events delivery efficency does matter
> >( pipetest ) :
> >
> This is a comparison of the cost of using epoll to the cost of using aio
> in one particular situation.  It is irrelevant to the point I was making.

See, I believe numbers talks. And it does make a pretty clear point
indeed.



> My understanding of the efficiency of the epoll event notification
> subsystem is:
>
> 1) Unlike the current aio poll, it amortizes the cost of interest
> registration/deregistration across multiple events for a given connection.

Yep


> 2) It declares multithreaded use out of scope, making optimizations that
> are only appropriate for use by single threaded callers.

It's not single threaded. It can be used in multithreaded environment if
the one that code the app has a minimal idea of what he's doing. Like
everything else. You cannot use a FILE* wildly sharing it randomly inside
a multithreaded app, and expecting to receive coherent results. Like 95%
of the APIs. Can those APIs be used in a multithreaded environment ? You bet,
with care, like everything that uses freakin' threads.




- Davide

Re: [Lse-tech] Re: and nicer too - Re: [PATCH] epoll more scalable than poll

From: Suparna B. <su...@in...> - 2002-10-31 11:08:30

On Wed, Oct 30, 2002 at 07:21:24PM -0800, Davide Libenzi wrote:
> On Wed, 30 Oct 2002, John Gardiner Myers wrote:
> 
> > Epoll creates a new callback mechanism and plugs into this new callback
> > mechansim.  It adds a new set of notification hooks which feed into this
> > new callback mechansim.  The end result is that there is one set of
> > notification hooks for classic poll and another set for epoll.  When
> > epoll is not being used, the poll and socket code makes an additional
> > set of checks to see that nobody has registered interest through the new
> > callback mechanism.
> 
> Where epoll hooks has nothing to do with ->f_po->poll()
> 

I think what John means, and what Jamie has also brought up in a 
separate note is that now when an event happens on an fd, in some cases
there are tests for 3 kinds of callbacks that get triggered -- the wait
queue for poll type registrations, the fasync list for sigio, and the
new epoll file send notify type callbacks. There is a little overhead
(not sure if significant) for each kind of test ...  

> 
> 
> > > It fits _exactly_
> > >the rt-signal hooks. One of the design goals for me was to add almost
> > >nothing on the main path. You can lookup here for a quick compare between
> > >aio poll and epoll for a test where events delivery efficency does matter
> > >( pipetest ) :
> > >
> > This is a comparison of the cost of using epoll to the cost of using aio
> > in one particular situation.  It is irrelevant to the point I was making.
> 
> See, I believe numbers talks. And it does make a pretty clear point
> indeed.
> 
> 
> 
> > My understanding of the efficiency of the epoll event notification
> > subsystem is:
> >
> > 1) Unlike the current aio poll, it amortizes the cost of interest
> > registration/deregistration across multiple events for a given connection.
> 
> Yep
> 

Adding persistent iocb support to aio doesn't appear too hard, and 
to be fair to aio, it does seem to help it come much closer to epoll, 
in fact very much closer at least for pipetest with a quickly hacked 
version that I tried. There still appears to be a gap remaining to 
be covered i.e epoll continuing to lead :) albeit by a smaller margin.

A little more magic is going on than just interest registration 
amortization (and I suspect its not just the threading argument), 
worth analysing if not for any other reason but to gain a better 
understanding of these 2 event delivery mechanisms the core for both
of which are now in the mainline kernel.

Regards
Suparna

-- 
Suparna Bhattacharya (su...@in...)
Linux Technology Center
IBM Software Labs, India

Re: [Lse-tech] Re: and nicer too - Re: [PATCH] epoll more scalable than poll

From: Davide L. <da...@xm...> - 2002-10-31 18:32:55

On Thu, 31 Oct 2002, Suparna Bhattacharya wrote:

> I think what John means, and what Jamie has also brought up in a
> separate note is that now when an event happens on an fd, in some cases
> there are tests for 3 kinds of callbacks that get triggered -- the wait
> queue for poll type registrations, the fasync list for sigio, and the
> new epoll file send notify type callbacks. There is a little overhead
> (not sure if significant) for each kind of test ...

The poll hooks is not where an edge triggered event notification API wants
to hook. For the way notification are sent and for the registration
method, that is not the most efficent thing. Hooking inside the fasync
list is worth to be investigated and I'll look into it as soon as I
finished the patch for 2.5.45 for Linus. It does have certain limits IMHO,
like the single lock protection. I'll look into it, even if the famous
cost for the extra callback check cannot even be measured IMHO.

- Davide

[Lse-tech] Re: and nicer too - Re: [PATCH] epoll more scalable than poll

From: Jamie L. <lk...@ta...> - 2002-10-30 23:02:09

John Gardiner Myers wrote:
> I am uncomfortable with the way the epoll code adds its own set of 
> notification hooks into the socket and pipe code.  Much better would be 
> to extend the existing set of notification hooks, like the aio poll code 
> does.

Fwiw, I agree with the above (I'm having a think about it).

I also agree with criticisms that epoll should test and send an event
on registration, but only _if_ the test is cheap.  Nothing to do with
correctness (I like the edge semantics as they are), but because
delivering one event is so infinitesimally low impact with epoll that
it's preferable to doing a single speculative read/write/whatever.

Regarding the effectiveness of the optimisation, I'd guess that quite
a lot of incoming connections do not come with initial data in the
short scheduling time after a SYN (unless it's on a LAN).  I don't
know this for sure though.

-- Jamie

[Lse-tech] Re: and nicer too - Re: [PATCH] epoll more scalable than poll

From: Davide L. <da...@xm...> - 2002-10-30 23:44:14

On Wed, 30 Oct 2002, Jamie Lokier wrote:

> John Gardiner Myers wrote:
> > I am uncomfortable with the way the epoll code adds its own set of
> > notification hooks into the socket and pipe code.  Much better would be
> > to extend the existing set of notification hooks, like the aio poll code
> > does.
>
> Fwiw, I agree with the above (I'm having a think about it).
>
> I also agree with criticisms that epoll should test and send an event
> on registration, but only _if_ the test is cheap.  Nothing to do with
> correctness (I like the edge semantics as they are), but because
> delivering one event is so infinitesimally low impact with epoll that
> it's preferable to doing a single speculative read/write/whatever.
>
> Regarding the effectiveness of the optimisation, I'd guess that quite
> a lot of incoming connections do not come with initial data in the
> short scheduling time after a SYN (unless it's on a LAN).  I don't
> know this for sure though.

Ok Jamie, try to explain me which kind of improvement this first drop will
bring. And also, how such first drop would not bring a "confusion" for the
user, letting him think that he can go sleeping event w/out having first
received EAGAIN. Isn't it better to say "you wait for events after EAGAIN",
instead of "you wait for events after EAGAIN but after accept/connect".
The cost of the test will be basically the cost of a ->poll(), that is
exactly the same cost of the very first read()/write() that you would do
by following the current API rule.

- Davide

[Lse-tech] Re: and nicer too - Re: [PATCH] epoll more scalable than poll

From: Jamie L. <lk...@ta...> - 2002-10-31 00:53:07

Davide Libenzi wrote:
> The cost of the test will be basically the cost of a ->poll(), that is
> exactly the same cost of the very first read()/write() that you would do
> by following the current API rule.

No, the cost of ->poll() is somewhat less than read()/write(), because
the latter requires a system call and the former does not.  System
calls are still nowhere near as cheap as function calls.

> > I also agree with criticisms that epoll should test and send an event
> > on registration, but only _if_ the test is cheap.  Nothing to do with
> > correctness (I like the edge semantics as they are), but because
> > delivering one event is so infinitesimally low impact with epoll that
> > it's preferable to doing a single speculative read/write/whatever.
> >
> > Regarding the effectiveness of the optimisation, I'd guess that quite
> > a lot of incoming connections do not come with initial data in the
> > short scheduling time after a SYN (unless it's on a LAN).  I don't
> > know this for sure though.
> 
> Ok Jamie, try to explain me which kind of improvement this first drop will
> bring.

I have thought about an optimal server state machine.  (I presume from
your carefully thought out implementation that you have too).

In a state machine, each fd has some user-space state.  I've already
hinted at how this is used to prevent starvation/livelock on a busy
server, and make service fairer.

I would take that further and _defer_ the epoll_ctl() to register an
fd until the first time I have seen EAGAIN from that fd.  This is
because in some cases, epoll_ctl() would not be needed at all - so we
can remove that overhead, and the system call overhead.

Now you would force me to call read() or write() after the
epoll_ctl(), even though I _know_ the result is always going to be
EAGAIN.  You're forcing me to make an always redundant system call.
But I can't omit it, because that's a race condition.

So, I've thought about the _optimal_ state machine and it's clear that
epoll should test the condition on fd registration - for best
performance.  (Nothing to do with scalability, just raw performance).

> And also, how such first drop would not bring a "confusion" for the
> user, letting him think that he can go sleeping event w/out having first
> received EAGAIN. Isn't it better to say "you wait for events after EAGAIN",
> instead of "you wait for events after EAGAIN but after accept/connect".

Be careful with your rules.  epoll should work with blocking fds too,
if you understand the rules well enough, and fd registration doesn't
have to be done at the same time as accept/connect/pipe.

Your current rule in practice is:

	an event is generated on every "would-block" -> "ready" transition.
	after fd registration, you must treat the fd as "ready".

The proposed rule is this:

	an event is generated on every "would-block" -> "ready" transition.
	after fd registration, you may treat the fd as in any state you like.

The proposed rule is better because it permits better optimisations in
user space, as explained earlier.  (If you _really_ want to avoid the
call to ->poll() when user space doesn't care, make that a flag
argument to epoll_ctl()).

enjoy :)
-- Jamie

[Lse-tech] Re: and nicer too - Re: [PATCH] epoll more scalable than poll

From: Davide L. <da...@xm...> - 2002-10-31 04:05:40

On Thu, 31 Oct 2002, Jamie Lokier wrote:

> No, the cost of ->poll() is somewhat less than read()/write(), because
> the latter requires a system call and the former does not.  System
> calls are still nowhere near as cheap as function calls.

Jamie, it's not for the cost, it's that IMHO is useless. And might
generate confusion on the API usage.

> I have thought about an optimal server state machine.  (I presume from
> your carefully thought out implementation that you have too).
>
> In a state machine, each fd has some user-space state.  I've already
> hinted at how this is used to prevent starvation/livelock on a busy
> server, and make service fairer.
>
> I would take that further and _defer_ the epoll_ctl() to register an
> fd until the first time I have seen EAGAIN from that fd.  This is
> because in some cases, epoll_ctl() would not be needed at all - so we
> can remove that overhead, and the system call overhead.
>
> Now you would force me to call read() or write() after the
> epoll_ctl(), even though I _know_ the result is always going to be
> EAGAIN.  You're forcing me to make an always redundant system call.
> But I can't omit it, because that's a race condition.
>
> So, I've thought about the _optimal_ state machine and it's clear that
> epoll should test the condition on fd registration - for best
> performance.  (Nothing to do with scalability, just raw performance).

Jamie I don't force you to call read/write soon. Your state machine will
have a state 0, from where everything starts. Let's say that this server
is an SMTP server and that supports PIPELINING. When a client connect (
you accept ) you will basically have your acceptor routine that puts the
fd for the new connection inside your list of ready-fds. Such list will
contain connection status, state machine state and a callback at the bare
bone. The whenever you feel it appropriate you pop the fd from the ready
list and you call the associated callback. That for state 0 will have
encoded "send SMTP welcome message" to the client. The socket write buffer
will be empty and you write() will return != EAGAIN. So you keep your fd
inside your ready list. Having a ready list enables you to handle
priorities, fairness, etc... Having successfully sent the welcome string
will move you to the next state, state 1. Whenever you'll find it
appropriate, you'll call again the callback associated with the file
descriptor, that for state 1 will have encoded "read SMTP command". Now
suppose that the SMTP client is lazy and you have nothing in the input
buffer ( or you partially read the SMTP command ). The read() will return
EAGAIN, you remain in state 1 and you remove the fd from your ready list.
This guy is _ready_ to generate an event. One of thenext time you'll call
epoll_wait(2) you'll find our famous fd ready to be used. You push it in
the ready list, and it's up to you, based on your fairness policies, to
use it soon or not. <b> The important thing is that you keep it in your
ready list and you do not go wait for it </b> Now the PIPELINING stuff
makes it worth to have your ready-fds list to apply fairness rules among
your clients. The above pattern repeats by moving your state machine among
your states, until finally, you reach the final state where you drop the
connection. Now, this one, that is a typical state machine implementation
can be _trivially_ implemented with epoll, and I don't see how adding an
initial event might help in this design. The other even more trivial
implementation using coroutines shows its semplicitly in a pretty clear
way.

> Be careful with your rules.  epoll should work with blocking fds too,
> if you understand the rules well enough, and fd registration doesn't
> have to be done at the same time as accept/connect/pipe.

Obviously you can register the fd whenever you want. I would take _a_lot_
of care using it with blocking files. Not because it will crash or
something like this but because you might stall you app on a reat/write
operation. Suppose you received your event, and you have 2000 bytes in
your input buffer for example. You start reading the data with a blocking
file and when the data is over you'll be waiting on tha system call, that
is definitely what you want to do in a 1:N ( one task, N files )
application architecture. You don't really want to use blocking files with
an edge triggered event API.

> Your current rule in practice is:
>
> 	an event is generated on every "would-block" -> "ready" transition.
> 	after fd registration, you must treat the fd as "ready".
>
> The proposed rule is this:
>
> 	an event is generated on every "would-block" -> "ready" transition.
> 	after fd registration, you may treat the fd as in any state you like.
>
> The proposed rule is better because it permits better optimisations in
> user space, as explained earlier.  (If you _really_ want to avoid the
> call to ->poll() when user space doesn't care, make that a flag
> argument to epoll_ctl()).

I still prefer 1) Jamie, besides the system call cost ( that is not always
a cost, see soon ready ops ), there's the fact of making the user to
follow a behavior pattern. That point 2) leaves uncertain. Now, I guess
that we will spend a lot of time arguing and talking about nothing. Let's
go to the code. Show me with real code ( possibly not 25000 lines :) )
where you get stuck w/out having the initial event and if it makes sense
and there's no clean way to solve it in user space, I'll seriously
consider your ( and John ) proposal.

- Davide

[Lse-tech] Re: and nicer too - Re: [PATCH] epoll more scalable than poll

From: Jamie L. <lk...@ta...> - 2002-10-31 15:07:41

Davide Libenzi wrote:
> [long description of ready lists]

Davide, you have exactly explained ready lists, which I assume we'd
already agreed on (cf. beer), and completely missed the point about
deferring the call to epoll_ctl().  You haven't mentioned that at all
your description.

Consider cacheing http proxy server.  Let's say 25% of the requests to
a proxy server are _not_ using pipelining.  Then you can save 25% of
the calls to epoll_ctl() if network conditions are favourable.

> Jamie I don't force you to call read/write soon.

You do if I try to optimise by deferring the call to epoll_ctl().
Let's see how my user space optimisation is affected in your description.

> Your state machine will
> have a state 0, from where everything starts. Let's say that this server
> is an SMTP server and that supports PIPELINING. When a client connect (
> you accept ) you will basically have your acceptor routine that puts the
> fd for the new connection inside your list of ready-fds. Such list will
> contain connection status, state machine state and a callback at the bare
> bone. The whenever you feel it appropriate you pop the fd from the ready
> list and you call the associated callback. That for state 0 will have
> encoded "send SMTP welcome message" to the client. The socket write buffer
> will be empty and you write() will return != EAGAIN. So you keep your fd
> inside your ready list. Having a ready list enables you to handle
> priorities, fairness, etc... Having successfully sent the welcome string
> will move you to the next state, state 1. Whenever you'll find it
> appropriate, you'll call again the callback associated with the file
> descriptor, that for state 1 will have encoded "read SMTP command". Now
> suppose that the SMTP client is lazy and you have nothing in the input
> buffer ( or you partially read the SMTP command ). The read() will return
> EAGAIN, you remain in state 1 and you remove the fd from your ready list.

At this point, I would call epoll_ctl().  Note, I do _not_ call
epoll_ctl() after accept(), because that is a waste of time.  It is
better to defer it because sometimes it is not needed at all.

> This guy is _ready_ to generate an event. One of thenext time you'll call
> epoll_wait(2) you'll find our famous fd ready to be used.

Most of the time this works, but there's a race condition: after I saw
EAGAIN and before I called epoll_ctl(), the state might have changed.
So I must call read() after epoll_ctl(), even though it is 99.99%
likely to return EAGAIN.

Here is the system call sequence that I end up executing:

	- read() = nbytes
	- read() = EGAIN
	- epoll_ctl()		// Called first time we see EAGAIN,
				// if we will want to read more.
	- read() = EGAIN

The final read is 99.99% likely to return EGAIN, and could be
eliminated if epoll_ctl() had an option to test the condition.

> Now, this one, that is a typical state machine implementation
> can be _trivially_ implemented with epoll, and I don't see how adding an
> initial event might help in this design. The other even more trivial
> implementation using coroutines shows its semplicitly in a pretty clear
> way.

That's because it doesn't help in that design.  It helps in a
different (faster in some scenarios ;) design.

> You don't really want to use blocking files with
> an edge triggered event API.

Agreed, it is rarely useful, but I felt your description ("event after
AGAIN") was technically incorrect.  Of course it would be ok to
_specify_ the API as only giving defined behaviour for non-blocking I/O.

> >
> > 	an event is generated on every "would-block" -> "ready" transition.
> > 	after fd registration, you must treat the fd as "ready".
> >
> > The proposed rule is this:
> >
> > 	an event is generated on every "would-block" -> "ready" transition.
> > 	after fd registration, you may treat the fd as in any state you like.
> >
> > The proposed rule is better because it permits better optimisations in
> > user space, as explained earlier.  (If you _really_ want to avoid the
> > call to ->poll() when user space doesn't care, make that a flag
> > argument to epoll_ctl()).
> 
> I still prefer 1) Jamie, besides the system call cost ( that is not always
> a cost, see soon ready ops ),

Do you avoid the cost of epoll_ctl() per new fd?

> there's the fact of making the user to follow a behavior
> pattern. That point 2) leaves uncertain.

That's the point :-) Flexibility is _good_.  It means that somebody
can implement a technique that you haven't thought of.

With 2) the _programmer_ (let's assume some level of understanding)
can use the exact application code that you offered.  It will work fine.

However if they're feeling clever, like me, they can optimise further.

I would suggest, though, to simply provide both options: EP_CTL_ADD
and EP_CTL_ADD_AND_TEST.  That's so explicit that nobody can be
confused!

> Show me with real code ( possibly not 25000 lines :) )
> where you get stuck w/out having the initial event and if it makes sense
> and there's no clean way to solve it in user space, I'll seriously
> consider your ( and John ) proposal.

Davide, I don't write buggy code deliberately.  My code would not get
stuck using present epoll.  Neither APIs 1) nor 2) have a bug, but
version 1) is slower with some kinds of state machine.

(Don't confuse me with the person who said your API is buggy -- it is
_not_ buggy, it's just not as flexible as it should be).

I can write code that shows the optimisation if that would make it
clearer.

-- Jamie

[Lse-tech] Re: and nicer too - Re: [PATCH] epoll more scalable than poll

From: Davide L. <da...@xm...> - 2002-10-31 19:00:35

On Thu, 31 Oct 2002, Jamie Lokier wrote:

> Davide, you have exactly explained ready lists, which I assume we'd
> already agreed on (cf. beer), and completely missed the point about
> deferring the call to epoll_ctl().  You haven't mentioned that at all
> your description.
>
> Consider cacheing http proxy server.  Let's say 25% of the requests to
> a proxy server are _not_ using pipelining.  Then you can save 25% of
> the calls to epoll_ctl() if network conditions are favourable.
>
> You do if I try to optimise by deferring the call to epoll_ctl().
> Let's see how my user space optimisation is affected in your description.
>
> At this point, I would call epoll_ctl().  Note, I do _not_ call
> epoll_ctl() after accept(), because that is a waste of time.  It is
> better to defer it because sometimes it is not needed at all.
>
> > This guy is _ready_ to generate an event. One of thenext time you'll call
> > epoll_wait(2) you'll find our famous fd ready to be used.
>
> Most of the time this works, but there's a race condition: after I saw
> EAGAIN and before I called epoll_ctl(), the state might have changed.
> So I must call read() after epoll_ctl(), even though it is 99.99%
> likely to return EAGAIN.
>
> Here is the system call sequence that I end up executing:
>
> 	- read() = nbytes
> 	- read() = EGAIN
> 	- epoll_ctl()		// Called first time we see EAGAIN,
> 				// if we will want to read more.
> 	- read() = EGAIN
>
> The final read is 99.99% likely to return EGAIN, and could be
> eliminated if epoll_ctl() had an option to test the condition.
>
> Do you avoid the cost of epoll_ctl() per new fd?

Jamie, the cost of epoll_ctl(2) is minimal/zero compared with the average
life of a connection. Also it might be done once at fd "creation" and one
at fd "removal". It's not inside an high frequency loop like epoll_wait(2).
Believe me, or ... do not believe me and show me a little performance data
that shows up this performance degradation due a soonish epoll_ctl(2). And
Jamie, if you really really want to use such pattern ( delaying the fd
registration, that IMHO does not help you in getting any performance boost )
you can still do it in user space ( poll() timeout 0 after epoll_ctl(2) ).
Jamie, I'm _really_ willing to be contradicted with performance data here.

> I would suggest, though, to simply provide both options: EP_CTL_ADD
> and EP_CTL_ADD_AND_TEST.  That's so explicit that nobody can be
> confused!

The EP_CTL_ADD_AND_TEST would do the poll() timeout 0 trick in kernel
space. Is it fast done in kernel ? Sure, you can measure something at
rates of 500000 of registration per second. Even here Jamie, I'm willing
to be contradicted by performance data. You're trying to optimize
something that is not inside an high frequency loop, and it's going to
give you any measurable improvement IMHO.

- Davide

[Lse-tech] Re: and nicer too - Re: [PATCH] epoll more scalable than poll

From: Dan K. <da...@ke...> - 2002-11-01 17:26:18

Davide Libenzi wrote:
>>Do you avoid the cost of epoll_ctl() per new fd?
> 
> Jamie, the cost of epoll_ctl(2) is minimal/zero compared with the average
> life of a connection.

Depends on the workload.  Where I work, the http client I'm writing
has to perform extremely well even on 1 byte files with HTTP 1.0.
Minimizing system calls is suprisingly important - even
a gettimeofday hurts.

- Dan

[Lse-tech] Re: and nicer too - Re: [PATCH] epoll more scalable than poll

From: Davide L. <da...@xm...> - 2002-11-01 17:36:03

On Fri, 1 Nov 2002, Dan Kegel wrote:

> Davide Libenzi wrote:
> >>Do you avoid the cost of epoll_ctl() per new fd?
> >
> > Jamie, the cost of epoll_ctl(2) is minimal/zero compared with the average
> > life of a connection.
>
> Depends on the workload.  Where I work, the http client I'm writing
> has to perform extremely well even on 1 byte files with HTTP 1.0.
> Minimizing system calls is suprisingly important - even
> a gettimeofday hurts.

Dan, is it _one_ gettimeofday() or a gettimeofday() inside a loop ?
gettimeofday() is of the order of few microseconds ... and If your clients
works with anything alse than a loopback, few microseconds shouldn't weigh
in much compared to connect/send/recv/close on a network connection. It is
not much the fact that you transfer one byte, it's the whole TCP handshake
cost that weighs in.

- Davide

[Lse-tech] Re: and nicer too - Re: [PATCH] epoll more scalable than poll

From: Dan K. <da...@ke...> - 2002-11-01 18:25:32

Davide Libenzi wrote:
> On Fri, 1 Nov 2002, Dan Kegel wrote:
> 
>>Davide Libenzi wrote:
>>
>>>>Do you avoid the cost of epoll_ctl() per new fd?
>>>
>>>Jamie, the cost of epoll_ctl(2) is minimal/zero compared with the average
>>>life of a connection.
>>
>>Depends on the workload.  Where I work, the http client I'm writing
>>has to perform extremely well even on 1 byte files with HTTP 1.0.
>>Minimizing system calls is suprisingly important - even
>>a gettimeofday hurts.
> 
> Dan, is it _one_ gettimeofday() or a gettimeofday() inside a loop ?
> gettimeofday() is of the order of few microseconds ... and If your clients
> works with anything alse than a loopback, few microseconds shouldn't weigh
> in much compared to connect/send/recv/close on a network connection. It is
> not much the fact that you transfer one byte, it's the whole TCP handshake
> cost that weighs in.

The scenario is: we're doing load testing of http products,
and for various reasons, we want line-rate traffic with
the smallest possible message size.  i.e. we want the
maximum number of HTTP requests/responses per second.
Hence the 1 byte payloads.   A single system call on the
slowish embedded processor I'm using has a suprisingly large
impact on the number of http gets per second I can do.
A 1% increase in speed is worth it for me!

So please do try to reduce the number of syscalls needed
to handle very short TCP sessions, if possible.

- Dan

[Lse-tech] Re: and nicer too - Re: [PATCH] epoll more scalable than poll

From: Jamie L. <lk...@ta...> - 2002-11-01 19:17:04

Dan Kegel wrote:
> Davide Libenzi wrote:
> >>Do you avoid the cost of epoll_ctl() per new fd?
> >
> >Jamie, the cost of epoll_ctl(2) is minimal/zero compared with the average
> >life of a connection.
> 
> Depends on the workload.  Where I work, the http client I'm writing
> has to perform extremely well even on 1 byte files with HTTP 1.0.
> Minimizing system calls is suprisingly important - even
> a gettimeofday hurts.

For this sort of thing, I would like to see an option to automatically
set the non-blocking flag on accept().  To really squeeze the system
calls, you could also automatically epoll-register on accept(), and
for super bonus automatically do the accept() at event delivery time.

But it's getting very silly at that point.

-- Jamie

[Lse-tech] Re: and nicer too - Re: [PATCH] epoll more scalable than poll

From: Charlie K. <kr...@ac...> - 2002-11-01 20:05:11

Jamie Lokier <lk...@ta...> writes:

> For this sort of thing, I would like to see an option to automatically
> set the non-blocking flag on accept().  To really squeeze the system
> calls, you could also automatically epoll-register on accept(), and
> for super bonus automatically do the accept() at event delivery time.

> But it's getting very silly at that point.

> -- Jamie

I would like to see a new kind of nonblocking flag that implies the
use of epoll.  So instead of giving O_NONBLOCK to fctnl(F_SETFL), you
give O_NONBLOCK_EPOLL.  In addition to becoming non-blocking, the
socket is added to epoll interest set.  Furthermore, if the socket is
a "listener" socket, all connections accepted on the socket inherit
the non-blocking status and are added automatically to the same epoll
interest set.  It's true that this can get silly though.  I'd like to
do the same with other flags, like TCP_CORK.

-- Buck

> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to maj...@kv....  For more info on Linux AIO,
> see: http://www.kvack.org/aio/

[Lse-tech] Re: and nicer too - Re: [PATCH] epoll more scalable than poll

From: Jamie L. <lk...@ta...> - 2002-11-01 20:14:28

Charlie Krasic wrote:
> I would like to see a new kind of nonblocking flag that implies the
> use of epoll.  So instead of giving O_NONBLOCK to fctnl(F_SETFL), you
> give O_NONBLOCK_EPOLL.  In addition to becoming non-blocking, the
> socket is added to epoll interest set.  Furthermore, if the socket is
> a "listener" socket, all connections accepted on the socket inherit
> the non-blocking status and are added automatically to the same epoll
> interest set.  It's true that this can get silly though.  I'd like to
> do the same with other flags, like TCP_CORK.

... and close-on-exec.

-- Jamie

[Lse-tech] Re: and nicer too - Re: [PATCH] epoll more scalable than poll

From: Mark M. <ma...@ma...> - 2002-11-01 20:20:35

On Fri, Nov 01, 2002 at 07:16:43PM +0000, Jamie Lokier wrote:
> > Depends on the workload.  Where I work, the http client I'm writing
> > has to perform extremely well even on 1 byte files with HTTP 1.0.
> > Minimizing system calls is suprisingly important - even
> > a gettimeofday hurts.
> For this sort of thing, I would like to see an option to automatically
> set the non-blocking flag on accept().  To really squeeze the system
> calls, you could also automatically epoll-register on accept(), and
> for super bonus automatically do the accept() at event delivery time.
> But it's getting very silly at that point.

Not really... isn't accept() automatically performed ahead of time anyways,
as long as the listen queue isn't full?

Another issue for the 'unified event notification model':

    How does epoll interact with signals, specifically the race
    condition between determining the timeout that should be passed
    to epoll_wait(), and epoll_wait() itself? (see pselect() for info)
    For example: it is very regular for priority to be given to a fd
    callback before a signal callback, meaning that epoll_wait() would
    be called with timeout=0 if a received signal did not have its
    callback executed yet, or something greater, otherwise.

I would like to see at least of the following (suggestions made by
other people) in the final version:

    1) Userspace data pointer to allow more efficient userspace dispatching
       when epoll_wait() returns. (Something about scanning array structures
       for matching fd arguments rubs me the wrong way -- it shouldn't be
       necessary)

    2) Reduced requirements to issue system calls such as read() when EAGAIN
       is the expected return value. The whole 'do a quick poll() or similar
       at registration time upon request' issue - for obscure cases that would
       require complex code, or code that cannot yet be agreed upon, this
       could temporarily mark events ready at registration without checking,
       with a goal of eliminating this behaviour one type of file at a time.

Although the ability to wait on futex or timeout objects seems clever,
I'm not sure that we are at a point that we know how they would be
commonly used yet. Right now people need a poll() replacement for file
descriptors. Timeouts can be handled by manipulating the argument
to epoll_wait() and performing userspace analysis (same as poll()).

Futex objects have not (to my knowledge) yet been used in great
numbers at the same time (i.e. wait for 100 futexes to be obtained)
probably because the routines necessary to perform this operation
do not yet exist. It might be nice to fit this into epoll later, but
it doesn't need to yet.

mark

-- 
ma...@mi.../ma...@nc.../ma...@no... __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
                       and in the darkness bind them...

                           http://mark.mielke.cc/

[Lse-tech] Unifying epoll,aio,futexes etc. (What I really want from epoll)

From: Jamie L. <lk...@ta...> - 2002-10-31 15:43:30

ps. I thought I should explain what bothers me most about epoll at the
moment.  It's good at what it does, but it's so very limited in what
it supports.

I have a high performance server application in mind, that epoll is
_almost_ perfect for but not quite.

Davide, you like coroutines, so perhaps you will appreciate a web
server that serves a mixture of dynamic and static content, using
coroutines and user+kernel threading in a carefully balanced way.
Dynamic content is cached, accurately (taking advantage of nanosecond
mtimes if possible), yet served as fast as static pages (using a
clever cache validation method), and is built from files (read using
aio to improve throughput) and subrequests to other servers just like
a proxy.  Data is served zero-copy using sendfile and /dev/shm.

A top quality server like that, optimised for performance, has to
respond to these events:

	- network accept()
	- read/write/exception on sockets and pipes
	- timers
	- aio
	- futexes
	- dnotify events 

See how epoll only helps with the first two?  And this is the very
application space that epoll could _almost_ be perfect for.

Btw, it doesn't _have_ to be a web server.  Enterprise scale Java
runtimes, database servers, spider clients, network load generators,
proxies, even humble X servers - also have very similar requirements.

There are several scalable and fast event queuing mechanisms in the
kernel now: rt-signals, aio and epoll, yet each of them is limited by
only keeping track of a few kinds of possible event.

Technically, it's possible to use them all together.  If you want to
react to all the kinds of events I listed above, you have to.  But
it's mighty ugly code to use them all at once, and it's certainly not
the "lean and mean" event loop that everyone aspires to.

By adding yet another mechanism without solving the general problem,
epoll just makes the mighty ugly userspace more ugly.  (But it's
probably worth using - socket notifcation through rt-signals has its
own problems).

I would very much like to see a general solution to the problem of all
different kinds of events being queued to userspace efficiently,
through one mechanism ("to bind them all...").  Every piece of this puzzle
has been written already, they're just not joined up very well.

I'm giving this serious thought now, if anyone wants to offer input.

-- Jamie

ps. Alan, you mentioned something about futexes being suitable.
Was that a vague notion, or do you have a clear idea in mind?

(A nice way to collect events from a _set_ of futexes might be just the thing.)

[Lse-tech] Re: Unifying epoll,aio,futexes etc. (What I really want from epoll)

From: Alan C. <al...@lx...> - 2002-10-31 16:26:15

On Thu, 2002-10-31 at 15:41, Jamie Lokier wrote:
> 	- network accept()
> 	- read/write/exception on sockets and pipes
> 	- timers
> 	- aio
> 	- futexes
> 	- dnotify events 
> 
> See how epoll only helps with the first two?  And this is the very
> application space that epoll could _almost_ be perfect for.
>
> ps. Alan, you mentioned something about futexes being suitable.
> Was that a vague notion, or do you have a clear idea in mind?
> 
> (A nice way to collect events from a _set_ of futexes might be just the thing.)

The futexes do all the high performance stuff you actually need. One way
to do it is to do user space signal delivery setting futexes off but
that means user space switches and is just wrong. Setting a list of
futexes instead of signal delivery in kernel space is fast. Letting the
user pick what futexes get set allows you to do neat stuff like trees of
wakeup without having to handle t kernel side.

What is hard is multiple futex waits and livelock for that. I think it
can be done properly but I've not sat down and designed it all out - I
wonder what Rusty thinks.

Alan

[Lse-tech] Re: Unifying epoll,aio,futexes etc. (What I really want from epoll)

From: Davide L. <da...@xm...> - 2002-10-31 20:18:43

On Thu, 31 Oct 2002, Jamie Lokier wrote:

> ps. I thought I should explain what bothers me most about epoll at the
> moment.  It's good at what it does, but it's so very limited in what
> it supports.
>
> I have a high performance server application in mind, that epoll is
> _almost_ perfect for but not quite.
>
> Davide, you like coroutines, so perhaps you will appreciate a web
> server that serves a mixture of dynamic and static content, using
> coroutines and user+kernel threading in a carefully balanced way.
> Dynamic content is cached, accurately (taking advantage of nanosecond
> mtimes if possible), yet served as fast as static pages (using a
> clever cache validation method), and is built from files (read using
> aio to improve throughput) and subrequests to other servers just like
> a proxy.  Data is served zero-copy using sendfile and /dev/shm.
>
> A top quality server like that, optimised for performance, has to
> respond to these events:
>
> 	- network accept()
> 	- read/write/exception on sockets and pipes
> 	- timers
> 	- aio
> 	- futexes
> 	- dnotify events
>
> See how epoll only helps with the first two?  And this is the very
> application space that epoll could _almost_ be perfect for.
>
> Btw, it doesn't _have_ to be a web server.  Enterprise scale Java
> runtimes, database servers, spider clients, network load generators,
> proxies, even humble X servers - also have very similar requirements.
>
> There are several scalable and fast event queuing mechanisms in the
> kernel now: rt-signals, aio and epoll, yet each of them is limited by
> only keeping track of a few kinds of possible event.
>
> Technically, it's possible to use them all together.  If you want to
> react to all the kinds of events I listed above, you have to.  But
> it's mighty ugly code to use them all at once, and it's certainly not
> the "lean and mean" event loop that everyone aspires to.
>
> By adding yet another mechanism without solving the general problem,
> epoll just makes the mighty ugly userspace more ugly.  (But it's
> probably worth using - socket notifcation through rt-signals has its
> own problems).
>
> I would very much like to see a general solution to the problem of all
> different kinds of events being queued to userspace efficiently,
> through one mechanism ("to bind them all...").  Every piece of this puzzle
> has been written already, they're just not joined up very well.
>
> I'm giving this serious thought now, if anyone wants to offer input.

Jamie, the fact that epoll supports a limited number of "objects" was an
as-designed at that time. I see it quite easy to extend it to support
other objects. Futexes are a matter of one line of code int :

/* Waiter either waiting in FUTEX_WAIT or poll(), or expecting signal */
static inline void tell_waiter(struct futex_q *q)
{
        wake_up_all(&q->waiters);
        if (q->filp) {
                send_sigio(&q->filp->f_owner, q->fd, POLL_IN);
+		file_notify_send(q->filp, ION_IN, POLLIN | POLLRDNORM);
	}
}

Timer, as long as you access them through a file* interface ( like futexes )
will become trivial too. Another line should be sufficent for dnotify :

void __inode_dir_notify(struct inode *inode, unsigned long event)
{
        struct dnotify_struct * dn;
        struct dnotify_struct **prev;
        struct fown_struct *    fown;
        int                     changed = 0;

        write_lock(&dn_lock);
        prev = &inode->i_dnotify;
        while ((dn = *prev) != NULL) {
                if ((dn->dn_mask & event) == 0) {
                        prev = &dn->dn_next;
                        continue;
                }
                fown = &dn->dn_filp->f_owner;
                send_sigio(fown, dn->dn_fd, POLL_MSG);
+		file_notify_send(dn->dn_filp, ION_IN, POLLIN | POLLRDNORM | POLLMSG);
                if (dn->dn_mask & DN_MULTISHOT)
                        prev = &dn->dn_next;
                else {
                        *prev = dn->dn_next;
                        changed = 1;
                        kmem_cache_free(dn_cache, dn);
                }
        }
        if (changed)
                redo_inode_mask(inode);
        write_unlock(&dn_lock);
}

This is the result of a quite quick analysis, but I do not expect it to be
much more difficult than that.




- Davide

<< < 1 2 3 4 > >> (Page 3 of 4)