From: Stephen D. <sd...@gm...> - 2012-10-11 12:02:51
|
On Wed, Oct 10, 2012 at 9:44 PM, Jeff Rogers <dv...@di...> wrote: > > It is possible to get into a situation where there are connections > queued but no conn threads running to handle them, meaning nothing > happens until a new connection comes in. When this happens the server > will also not shut down cleanly. As far as I can figure, this can only > happen if the connection queue is larger than connsperthread and the > load is very bursty (i.e., a load test); all the existing conn threads > can hit their cpt and exit, but a new conn thread only starts when a new > connection is queued. I think the solution here is to limit > maxconnections to no more than connsperthread. Doing so exposes a less > severe problem where connections waiting in the driver thread don't get > queued for some time; it's less of a problem because there is a timeout > and the dirver thread will typically wake up on a closing socket fairly > soon, but it can still result in a simple request taking ~3s to > complete. I don't know how to fix this latter problem. I think this is racy because all conn threads block on a single condition variable. The driver thread and conn threads must cooperate to manage the whole life cycle and the code to manage the state is spread around. If instead all conn thread were in a queue, each with it's own condition variable, the driver thread could have sole responsibility for choosing which conn thread to run by signalling it directly, probably in LIFO order rather than the current semi-round-robin order which tends to cause all conn threads to expire at once. Conn threads would return to the front of the queue, unless wishing to expire in which case they'd go on the back of the queue, and the driver would signal when it was convenient to do so. Something like that... |
From: Gustaf N. <ne...@wu...> - 2012-10-25 09:32:07
|
I don't think, that a major problem comes from the "racy" notification of queuing events to the connection threads. This has advantages (make os responsible, which does this very efficiently, less mutex requirements) and disadvantages (little control). While the current architecture with the cond-broadcast is certainly responsible for the problem of simultaneous dieing threads (the OS determines, which thread receives sees the condition first, therefore round robin), a list of the linked connection threads does not help to determine on how many threads are actually needed, how bursty thread creation should be, how to handle short resource quenches (e.g. caused by locks, etc.). By having a conn-thread-queue, the threads have to update this queue with their status information (being created, warming up, free, busy, will-die) which requires some overhead and more mutex locks on the driver. The thread-status-handling is done currently automatically, a "busy" request ignores currently the condition, etc. On the good side, we would have more control over the threads. When a dieing thread notifies the conn-thread-queue, one can control thread-creation via this hook the same way as on situations, where requests are queued. Another good aspect is, that the thread-idle-timeout starts to makes sense again on busy sites. Currently, the thread-reduction works via counter, since unneeded threads die and won't be recreated unless the traffic requires it (which works in practice quite well). For busy sites, the thread-idle timeout is not needed this way. currently we have a one-way communication from the driver to the conn-threads. with the conn-thread-list (or array), one has a two way communication, ... at least, how i understand this for now. -gustaf neumann On 11.10.12 14:02, Stephen Deasey wrote: > On Wed, Oct 10, 2012 at 9:44 PM, Jeff Rogers <dv...@di...> wrote: >> It is possible to get into a situation where there are connections >> queued but no conn threads running to handle them, meaning nothing >> happens until a new connection comes in. When this happens the server >> will also not shut down cleanly. As far as I can figure, this can only >> happen if the connection queue is larger than connsperthread and the >> load is very bursty (i.e., a load test); all the existing conn threads >> can hit their cpt and exit, but a new conn thread only starts when a new >> connection is queued. I think the solution here is to limit >> maxconnections to no more than connsperthread. Doing so exposes a less >> severe problem where connections waiting in the driver thread don't get >> queued for some time; it's less of a problem because there is a timeout >> and the dirver thread will typically wake up on a closing socket fairly >> soon, but it can still result in a simple request taking ~3s to >> complete. I don't know how to fix this latter problem. > I think this is racy because all conn threads block on a single > condition variable. The driver thread and conn threads must cooperate > to manage the whole life cycle and the code to manage the state is > spread around. > > If instead all conn thread were in a queue, each with it's own > condition variable, the driver thread could have sole responsibility > for choosing which conn thread to run by signalling it directly, > probably in LIFO order rather than the current semi-round-robin order > which tends to cause all conn threads to expire at once. Conn threads > would return to the front of the queue, unless wishing to expire in > which case they'd go on the back of the queue, and the driver would > signal when it was convenient to do so. Something like that... > > ------------------------------------------------------------------------------ > Don't let slow site performance ruin your business. Deploy New Relic APM > Deploy New Relic app performance management and know exactly > what is happening inside your Ruby, Python, PHP, Java, and .NET app > Try New Relic at no cost today and get our sweet Data Nerd shirt too! > http://p.sf.net/sfu/newrelic-dev2dev > _______________________________________________ > naviserver-devel mailing list > nav...@li... > https://lists.sourceforge.net/lists/listinfo/naviserver-devel |
From: Stephen D. <sd...@gm...> - 2012-10-26 19:30:53
|
On Thu, Oct 25, 2012 at 10:31 AM, Gustaf Neumann <ne...@wu...> wrote: > I don't think, that a major problem comes from the "racy" > notification of queuing events to the connection threads. > This has advantages (make os responsible, which does this > very efficiently, less mutex requirements) and disadvantages > (little control). I think the current code works something like this: - driver thread acquires lock, puts connection on queue, broadcasts to conn threads, releases lock - every conn thread waiting on the condition is woken up in some arbitrary but approximately round-robin order, and maybe some of the conn threads which aren't currently waiting pick up that message when they do wait, because for performance these things aren't strictly guaranteed (I may be remembering this wrong) - n conn threads race to acquire the lock - the one which gets the lock first take the conn from the queue and release the lock - it runs the connection, acquires the lock again, puts the conn back on the queue, and release the lock - meanwhile the other woken conn threads acquire then release the lock with possibly nothing to do. So for each request there can be up to 6 lock/unlock sequences by the driver and active conn thread, plus a lock/unlock by n other conn threads, all on one contended lock, plus the context switching overhead, and this all happens in an undesirable order. > By having a conn-thread-queue, the > threads have to update this queue with their status > information (being created, warming up, free, busy, > will-die) which requires some overhead and more mutex locks > on the driver. I was thinking it could work something like this: - driver acquires lock, takes first conn thread off queue, releases lock - driver thread puts new socket in conn structure and the signals on cond to that one thread (no locking, I don't think) - that conn thread wakes up, takes no locks, runs connection - conn thread acquires driver lock, puts conn back on front of queue, releases lock Four lock/unlock, lock/unlock sequences, two threads. |
From: Gustaf N. <ne...@wu...> - 2012-10-27 13:56:54
|
On 26.10.12 21:30, Stephen Deasey wrote: > I was thinking it could work something like this: > > > - driver acquires lock, takes first conn thread off queue, releases lock > > - driver thread puts new socket in conn structure and the signals on > cond to that one thread (no locking, I don't think) > > - that conn thread wakes up, takes no locks, runs connection > > - conn thread acquires driver lock, puts conn back on front of queue, > releases lock > > > Four lock/unlock, lock/unlock sequences, two threads. you are talking here about "lock" as if we would have a single mutex. i guess, you mean connection-thread-queue-lock in the first item. What happens with the request queue (the queue of the parsed connections, realized currently via the array of connections)? Is the request queue handling missing in the paragraph above, or are you considering to get rid of it at all (which would raise further questions). This picture is drawn from the assumption that the request queue contains one request and there is at least one connection thread waiting in the connection-thread-queue. But this is not always the case. Connection threads are not starting in 0-time; in the meanwhile many requests may be queued already. So, if a connection thread is ready to work, it should acquire a lock on the connection-thread-queue and add itself). Also, what happens if the connection thread terminates. Should it put itself in front of the connection-thread-queue as well to signal the driver to create fresh connection threads? This could address the problem we discussed recently, when the request queue is still quite full, but no new requests are coming in, but all connection threads are gone. If it adds itself as well, we could control in the driver thread and controls the liveliness of the connection threads (at least min threads running, etc.). So, i think the picture above is oversimplifying too much, but i guess the the overall message is certainly right. we can reduce the needless context switches this way. The question remains: how much is this really a problem? How much performance can be gained? Changing the notification structure (adding a connection-thread-queue and extra condition) is a relatively small change, compared to general redesign. -gustaf neumann |
From: Gustaf N. <ne...@wu...> - 2013-06-22 13:59:39
|
Am 04.12.12 23:55, schrieb Gustaf Neumann: > Am 04.12.12 20:06, schrieb Stephen Deasey: >> - we should actually ship some code which searches for *.gz versions >> of static files > this would mean to keep a .gz version and a non-.gz version in the > file system for the cases, where gzip is not an accepted encoding. Not > sure, i would like to manage these files and to keep it in sync.... > the fast-path cache could keep gzipped copies, invalidation is already > there. Dear all, The updated version of naviserver supports now the delivery of gzipped content via fastpath: - added option "gzip_static" for "ns/fastpath" (default false) Send the gzipped version of the file if available and the client accepts gzipped content. When a file path/foo.ext is requested, and there exists a file path/foo.ext.gz, and the timestamp of the gzipped file is equal or newer than the source file, use the gzipped file for delivery. - added option "gzip_cmd" for "ns/fastpath" (default "") Command for zipping files in case the (static) gzipped version of the file is older than the source. The command is just used for re-gzipping outdated files, it does not actively compress files, which were previously not compressed (this would be wasteful for e.g. large tmp files, there is not cleanup, etc.). If this parameter is not defined, outdated gzipped files are ignored, and a warning is written to the error.log. Example setting: "/usr/bin/gzip -9". - added section for fastpath configuration to ns_return describing the new and old configuration options (description of fastpath parameters was completely missing) When the gzip_cmd is configured, NaviServer keeps track of updating the .gz file for the cases the source file is updated. There is no burden for the admin. The gzip command is called via Tcl "exec", which can in turn be executed via nsproxy (see e.g. OpenACS). I took this approach over an in-memory variant to avoid memory bloats in case huge gzip files should be compressed. Note that the compression overhead is typically just a one-time operation. The website maintained defines, what files should be sent gzipped by compressing these once. Gzipped content delivery happens now for - ns_returnfile - ns_respond -file - static files from pagedir when the corresponding .gz files are available. all the best -gustaf neumann |
From: Jeff R. <dv...@di...> - 2013-07-04 03:13:43
|
Gustaf Neumann wrote: > When the gzip_cmd is configured, NaviServer keeps track of > updating the .gz file for the cases the source file is updated. > There is no burden for the admin. The gzip command is > called via Tcl "exec", which can in turn be executed via > nsproxy (see e.g. OpenACS). I took this approach > over an in-memory variant to avoid memory bloats in > case huge gzip files should be compressed. Note that > the compression overhead is typically just a one-time > operation. The website maintained defines, what > files should be sent gzipped by compressing these > once. I think there's a potential security hole here. I didn't come up with a proper exploit, but if a user can get control of a filename (e.g., if there is an ability to upload files), then an arbitrary string could get passed to the exec command, including but not limited to [] (which would let tcl do commmand expansion) or spaces (which could cause the filename to be interpreted as multiple words and hijack the exec behavior). Using Tcl_DStringAppendElement instead of Tcl_DStringAppend should prevent this, as it will force the filename to be a proper list element. Alternately, it would be more flexible to change the definition of the zipCmd to be a tcl command that is passed the filename and zipfile name, e.g., "gzip_cmd file file.gz", with the tcl definition of gzip_cmd choosing how to handle it, whether by exec or compressing in-process (e.g., with 'zlib compress'), or choosing based on the file size. -J |
From: Gustaf N. <ne...@wu...> - 2013-07-04 09:39:20
|
Hi Jeff, many thanks for the feedback. The logic is as follows: - when gzip_static is activated, then fastpath checks for the existence of a .gz file and uses it if not outdated. - if gzip_static is activated and there exists a gzip file and it is outdated, then the gzipcmd comes into play to update it, but only, when it is configured (by default it is not configured). So an attack vector is more complex than you sketched, but i guess possible, when gzipcmd was activated. In general, you are right: using a tcl proc is better since it makes it easy to change the behavior without touching C. I have added a boolean parameter "gzip_refresh" to make the intention of the site-admin for refreshing gzip files on the fly clear. A new tcl command "ns_gzipfile" is used now for gzipping (this command requires tcl 8.5 to work) https://bitbucket.org/naviserver/naviserver/commits/f7dca733625553ab802c93d35d471524e5a13e3e all the best -gustaf neumann Am 04.07.13 04:38, schrieb Jeff Rogers: > Gustaf Neumann wrote: > >> When the gzip_cmd is configured, NaviServer keeps track of >> updating the .gz file for the cases the source file is updated. >> There is no burden for the admin. The gzip command is >> called via Tcl "exec", which can in turn be executed via >> nsproxy (see e.g. OpenACS). I took this approach >> over an in-memory variant to avoid memory bloats in >> case huge gzip files should be compressed. Note that >> the compression overhead is typically just a one-time >> operation. The website maintained defines, what >> files should be sent gzipped by compressing these >> once. > I think there's a potential security hole here. I didn't come up with a > proper exploit, but if a user can get control of a filename (e.g., if > there is an ability to upload files), then an arbitrary string could get > passed to the exec command, including but not limited to [] (which would > let tcl do commmand expansion) or spaces (which could cause the filename > to be interpreted as multiple words and hijack the exec behavior). > > Using Tcl_DStringAppendElement instead of Tcl_DStringAppend should > prevent this, as it will force the filename to be a proper list element. > > Alternately, it would be more flexible to change the definition of the > zipCmd to be a tcl command that is passed the filename and zipfile name, > e.g., "gzip_cmd file file.gz", with the tcl definition of gzip_cmd > choosing how to handle it, whether by exec or compressing in-process > (e.g., with 'zlib compress'), or choosing based on the file size. > > -J > |
From: Jeff R. <dv...@di...> - 2012-10-25 15:28:27
|
I started working on some related ideas on a fork I created (naviserver-queues). The main thing I'm trying to improve is how the state-managing code is spread out across several functions and different threads. My approach is to have a separate monitor thread for each server that checks all the threads in each pool to see that there are enough threads running, that threads die when they have been around too long (services too many requests), and that threads die when no longer needed. The driver thread still just queues the requests, but now there's an extra layer of control, and yes, overhead. The question of how many threads are needed is an interesting one: is it better to create threads quickly in response to traffic, or to create them more slowly in case the traffic is very bursty? The answer is of course, it depends. So I'm assuming that the available processing power - the number of threads - should correlate to how busy the server is. A server that is 50% busy should have 50% of its full capacity working. I'm using the wait queue length as a proxy for business: if the wait queue is 50% full, then there should be 50% of maxthreads running. But this seems like it unnecessarily underpowers the server, I added in an adjustable correction factor to scale this, so that you could tune for eager thread creation so you will be close to maxthreads when the server is 20% busy, or you could wait until the server is 80% busy before spawning more than a few threads. On the idle end it works similarly; if you tuned for eager thread creation then the threads wait longer to idle out. I expect that this approach would lead to a sort of self-balancing: if the queue gets bigger then it starts getting serviced faster and stops growing, while if it shrinks then it gets serviced slower and stops shrinking. There's room for experimentation here on exactly how to tune it; in my revision this logic is all in one place so it's simple. My initial testing shows that the server handles the same throughput (total req/s about the same) and is a bit more equitable (smaller difference between slowest and fastest request) but slightly less responsive (which is expected, since the requests inherently spend longer in the wait queue.) I'm still cleaning it up and it's definitely not ready for prime-time, but I'd be interested to hear what others think. -J Gustaf Neumann wrote: > I don't think, that a major problem comes from the "racy" > notification of queuing events to the connection threads. > This has advantages (make os responsible, which does this > very efficiently, less mutex requirements) and disadvantages > (little control). > > While the current architecture with the cond-broadcast is > certainly responsible for the problem of simultaneous dieing > threads (the OS determines, which thread receives sees the > condition first, therefore round robin), a list of the > linked connection threads does not help to determine on how > many threads are actually needed, how bursty thread creation > should be, how to handle short resource quenches (e.g. > caused by locks, etc.). By having a conn-thread-queue, the > threads have to update this queue with their status > information (being created, warming up, free, busy, > will-die) which requires some overhead and more mutex locks > on the driver. The thread-status-handling is done currently > automatically, a "busy" request ignores currently the > condition, etc. > > On the good side, we would have more control over the > threads. When a dieing thread notifies the > conn-thread-queue, one can control thread-creation via this > hook the same way as on situations, where requests are > queued. Another good aspect is, that the thread-idle-timeout > starts to makes sense again on busy sites. Currently, the > thread-reduction works via counter, since unneeded threads > die and won't be recreated unless the traffic requires it > (which works in practice quite well). For busy sites, the > thread-idle timeout is not needed this way. > > currently we have a one-way communication from the driver to > the conn-threads. with the conn-thread-list (or array), one > has a two way communication, ... at least, how i understand > this for now. >> I think this is racy because all conn threads block on a single >> condition variable. The driver thread and conn threads must cooperate >> to manage the whole life cycle and the code to manage the state is >> spread around. >> >> If instead all conn thread were in a queue, each with it's own >> condition variable, the driver thread could have sole responsibility >> for choosing which conn thread to run by signalling it directly, >> probably in LIFO order rather than the current semi-round-robin order >> which tends to cause all conn threads to expire at once. Conn threads >> would return to the front of the queue, unless wishing to expire in >> which case they'd go on the back of the queue, and the driver would >> signal when it was convenient to do so. Something like that... >> |
From: Stephen D. <sd...@gm...> - 2012-10-26 20:07:19
|
Interesting, but I wonder if we're not thinking this through correctly. My suggestion, and your here, and Gustaf's recet work are all aimed at refining the model as it currently is, but I wonder if we're even attempting to do the right thing? > So I'm assuming that the available processing power - the number of > threads - should correlate to how busy the server is. A server that is > 50% busy should have 50% of its full capacity working. But what is busy, CPU? There needs to be an appropriate max number of threads to handle the max expected load, considering the capabilities of the machine. Too many and the machine will run slower. But why kill them when we're no longer busy? - naviserver conn threads use a relatively large amount of memory because there tends to be one or more tcl interps associated with each one - killing threads kills interps which frees memory But this is only useful if you can use the memory more profitably some where else, and I'm not sure you can. It is incoming load which drives conn thread creation, and therefore memory usage, not availability of memory. So if you kill of some conn threads when they're not needed, freeing up some memory for some other system, how do you get the memory back when you create conn threads again? There needs to be some higher mechanism which has a global view of, say your database and web server requirements, and can balance the memory needs between them. I think it might be better to drop min/max conn threads and just have n conn threads, always: - simpler code - predictable memory footprint - bursty loads aren't delayed waiting for conn threads/interps to be created - interps can be fully pre-warmed without delaying requests - could back-port aolserver's ns_pools command to dynamically set the nconnthreads setting With ns_pools you could do something like use a scheduled proc to set the nconnthreads down to 10 from 20 between 3-5am when your database is taking a hefty dump. Thread pools are used throughout the server: multiple pools of conn threads, driver spool threads, scheduled proc threads, job threads, etc. so one clean way to tackle this might be to create a new nsd/pools.c which implements a very simple generic thread pool which has n threads, fifo ordering for requests, a tcl interface for dynamically setting the number of threads, and thread recycling after n requests. Then try to implement conn threads in terms of it. btw. an idea for pre-warming conn thread interps: generate a synthetic request to /_ns/pool/foo/warmup (or whatever) when the thread is created, before it is added to the queue. This would cause the tcl source code to be byte compiled, and this could be controlled precisely be registering a proc for that path. |
From: Andrew P. <at...@pi...> - 2012-10-26 22:28:55
|
On Fri, Oct 26, 2012 at 08:30:26PM +0100, Stephen Deasey wrote: > I was thinking it could work something like this: > > - driver acquires lock, takes first conn thread off queue, releases lock What if there are no conn threads waiting in the queue? -- Andrew Piskorski <at...@pi...> |
From: Jeff R. <dv...@di...> - 2012-10-26 22:41:49
|
Stephen Deasey wrote: > Interesting, but I wonder if we're not thinking this through > correctly. My suggestion, and your here, and Gustaf's recet work are > all aimed at refining the model as it currently is, but I wonder if > we're even attempting to do the right thing? Do we even know what the right thing is? It could be any of - maximize performance at any cost - minimize resource usage - adapt to dynamically changing workload - minimize admin workload And so forth. I think there is no one-size-fits-all answer, but it should be possible, and hopefully easy, to get something that fits closely enough. >> So I'm assuming that the available processing power - the number of >> threads - should correlate to how busy the server is. A server that is >> 50% busy should have 50% of its full capacity working. > > But what is busy, CPU? There needs to be an appropriate max number of > threads to handle the max expected load, considering the capabilities > of the machine. Too many and the machine will run slower. But why kill > them when we're no longer busy? I don't know how to precisely define, let alone measure busy, which is why I'm picking something that is readily measurable. A thread is busy in the sense that it is unavailable to process new requests, but there are different reasons why it might be unavailable - either it's cpu bound, or it's blocking on something like a database call or i/o. Different reasons for being busy might suggest different responses. > - naviserver conn threads use a relatively large amount of memory > because there tends to be one or more tcl interps associated with each > one Different requests could have different memory/resource requirements, which is a really nice thing about pools. I'm hypothesizing that there are several different categories that requests fall into, based on the server resources needed to serve them and the time needed to complete them. Server resources (memory) is either 'small' for requests that do not need a tcl interp (although tcl filters could tend to make this a nonexistent set), or 'big' for those that do. Time is either slow or fast, by some arbitrary measure. So a small/fast pool could be set up to serve static resources, a big/fast pool for non-database scripts, and a big/slow pool for database stuff. I'm not sure what could be small/slow, maybe a c-coded proxy server or very large static files being delivered over slow connections. The small/fast pool would only need a small number of threads with a high maxconnsperthread, while the large/slow pool might have many threads as most of those will be blocking on database access at any given time. The important question in all of this is if a complex segmented setup like this works better in practice than a single large-enough pool of equal threads. To which I don't have a good answer. > - killing threads kills interps which frees memory > > But this is only useful if you can use the memory more profitably some > where else, and I'm not sure you can. Not only that, but memory isn't always released back to the system when free()d. (vtamlloc is supposed to be able to, but I haven't had too much success with it so far.) So freeing memory by shutting down threads won't necessarily make that available to your database. However, memory that is not used by a process could be swapped out, making more physical ram available for other processes. Having 20 threads all used a little bit could keep them all in memory while having just 1 used a lot would keep a smaller working set. This is a much bigger concern for low-resource systems. Big systems nowadays have more physical memory than you can shake a stick at and swapping seems almost quaint. > I think it might be better to drop min/max conn threads and just have > n conn threads, always: I've heard this recommendation before, in the context of tuning apache for high workloads - set maxservers=minservers=startservers. I think it would make tuning easier for a lot of people if these was a basic "systemsize" parameter that is small/medium/large that set various other parameters to preset values. As to what those values should be, that would take some thinking and experimentation. > Thread pools are used throughout the server: multiple pools of conn > threads, driver spool threads, scheduled proc threads, job threads, > etc. so one clean way to tackle this might be to create a new > nsd/pools.c which implements a very simple generic thread pool which > has n threads, fifo ordering for requests, a tcl interface for > dynamically setting the number of threads, and thread recycling after > n requests. Then try to implement conn threads in terms of it. I was thinking the exact same thing. Sorry for the rambling/scattered thoughts, having a long commute does that :/ -J |
From: Stephen D. <sd...@gm...> - 2012-10-27 12:57:17
|
On Fri, Oct 26, 2012 at 11:41 PM, Jeff Rogers <dv...@di...> wrote: > Stephen Deasey wrote: >> .., but I wonder if >> we're even attempting to do the right thing? > > Do we even know what the right thing is? It could be any of > - ... > - minimize resource usage > - adapt to dynamically changing workload > - ... Right, but you can't actually use the memory that temporarily running fewer threads frees up because the threads may be restarted at any moment, so the goal of adapting and minimizing is not being met. Memory is being wasted, not recycled. (maybe, that's what I'm suggesting...) > Not only that, but memory isn't always released back to the system when > free()d. (vtamlloc is supposed to be able to, but I haven't had too > much success with it so far.) So freeing memory by shutting down > threads won't necessarily make that available to your database. I think glibc these days uses POSIX_MADV_DONTNEED on ranges within mmap'd areas it uses for malloc. It doesn't reduce the address usage, so this only shows up as lower RSS under top. > Server resources (memory) is either 'small' for requests that do not > need a tcl interp (although tcl filters could tend to make this a > nonexistent set), or 'big' for those that do. Time is either slow or > fast, by some arbitrary measure. > > So a small/fast pool could be set up to serve static resources, a > big/fast pool for non-database scripts, and a big/slow pool for database > stuff. Naviserver has some stuff which AOLserver doesn't to help with this partitioning. Gustaf gave an example the other day where a server gets a small burst of traffic, a couple of page requests, but the browser simultaneously requests all the css and javascript etc., which causes a bunch of new threads to be created and a stall as they all have their interps allocated. You can create a non-tcl pool as well as the default pool and then use some tricks to force certain types of requests not to use Tcl. - ACS registers a default handler for /*, but naviserver exposes ns_register_fastpath with which you can re-register the pure C fastpath handler for /*.css etc. - ACS has an elaborate search path for files so the above is not enough. But naviserver provides the url2file callback interface. A pure C version of the algorithm which maps a url path to a file path code be coded and then the fastpath code would correctly find package specific static assets. - There's auth filters and so on registered for /* but again they aren't needed for /*.css. You can use ns_shortcut_filter to push a null filter to the front of the filter queue which simply returns OK and prevents the rest from running. > The small/fast pool would only need a small number of threads with a > high maxconnsperthread, while the large/slow pool might have many > threads as most of those will be blocking on database access at any > given time. This is sort of a recreation of the memory situation I was suggesting may bot be working. Balancing memory between naviserver and postgress is like balancing the threads (memory) in two pools. The max number of threads across all pools can't be so high that it overwhelms the server. Within that constraint you have to balance the threads between the pools. If you have two pools each with 10 threads, and one pool is busy but the other is idle, then the server is not running to capacity, but you may have to reject requests. If you increase the threads in the busy pool, the other pool may also become busy and now the server is overwhelmed. Tcl-using conn threads are often so memory intensive it seems like it would always be a win to have two conn thread pools for Tcl and non-Tcl threads. To partition further there's ns_limits but that's not hooked up and needs more work. |
From: Jeff R. <dv...@di...> - 2012-10-26 22:44:31
|
Andrew Piskorski wrote: > On Fri, Oct 26, 2012 at 08:30:26PM +0100, Stephen Deasey wrote: > >> I was thinking it could work something like this: >> >> - driver acquires lock, takes first conn thread off queue, releases lock > > What if there are no conn threads waiting in the queue? > Same as currently I'd think: the driver holds on to them as waiting sockets. I think the handling of this is a bit less efficient than putting them on the conn queue tho, as it creates more work for the driver to do on every spin and it needs to get woken up once threads are available. -J |
From: Stephen D. <sd...@gm...> - 2012-10-27 11:39:07
|
On Fri, Oct 26, 2012 at 11:44 PM, Jeff Rogers <dv...@di...> wrote: > Andrew Piskorski wrote: >> On Fri, Oct 26, 2012 at 08:30:26PM +0100, Stephen Deasey wrote: >> >>> I was thinking it could work something like this: >>> >>> - driver acquires lock, takes first conn thread off queue, releases lock >> >> What if there are no conn threads waiting in the queue? >> > > Same as currently I'd think: the driver holds on to them as waiting > sockets. I think the handling of this is a bit less efficient than > putting them on the conn queue tho, as it creates more work for the > driver to do on every spin and it needs to get woken up once threads are > available. - driver takes the lock, sees that there are no threads in the thread queue, puts conn on the back of the conn queue, does not signal anything, releases the lock - conn thread completes a request, takes the driver lock. -- If the conn queue is empty it puts itself on the front of the thread queue and releases the lock. -- Otherwise it takes then next conn and releases the lock. |
From: Gustaf N. <ne...@wu...> - 2012-10-28 02:09:10
|
On 27.10.12 15:56, Gustaf Neumann wrote: > Changing the notification structure (adding a > connection-thread-queue and extra condition) is a relatively > small change, compared to general redesign. i've just implemented lightweight version of the above (just a few lines of code) by extending the connThread Arg structure; i have not handled the spooling and stopping server cases (e.g. NsWaitServer), but this looks promising and can be still optimized further. it runs already the regression test. When maxthreads connThread Arg structures are allocated per pool at start-time, one can iterate over these for the stopping cases to compensate for the needed Ns_CondBroadcasts. No additional thread is needed. -gustaf neumann |
From: Gustaf N. <ne...@wu...> - 2012-10-29 12:41:23
|
A version of this is in the following fork: https://bitbucket.org/gustafn/naviserver-connthreadqueue/changesets So far, the competition on the pool mutex is quite high, but i think, it can be improved. Currently primarily the pool mutex is used for conn thread life-cycle management, and it is needed from the main/drivers/spoolers as well from the connection threads to update the idle/running/.. counters needed for controlling thread creation etc. Differentiating these mutexes should help. i have not addressed the termination signaling, but that's rather simple. -gustaf neumann On 28.10.12 03:08, Gustaf Neumann wrote: > i've just implemented lightweight version of the above (just > a few lines of code) by extending the connThread Arg > structure; .... |
From: Gustaf N. <ne...@wu...> - 2012-11-01 19:17:10
|
Dear all, There is now a version on bitbucket, which works quite nice and stable, as far i can tell. I have split up the rather coarse lock of all pools and introduced finer locks for waiting queue (wqueue) and thread queue (tqueue) per pool. The changes lead to significant finer lock granularity and improve scalability. I have tested this new version with a synthetic load of 120 requests per seconds, some slower requests and some faster ones, and it appears to be pretty stable. This load keeps about 20 connection threads quite busy on my home machine. The contention of the new locks is very little: on this test we saw 12 busy locks on 217.000 locks on the waiting queue, and 9 busy locks out of 83.000 locks on the thread queue. These measures are much better than in current naviserver, which has on the same test on the queue 248.000 locks with 190 busy ones. The total waiting time for locks is reduced by a factor of 10. One has to add, that it was not so bad before either. The benefit will be larger when multiple pools are used. Finally i think, the code is clearer than before, where the lock duration was quite tricky to determine. opinions? -gustaf neumann PS: For the changes, see: https://bitbucket.org/gustafn/naviserver-connthreadqueue/changesets PS2: have not addressed the server exit signaling yet. On 29.10.12 13:41, Gustaf Neumann wrote: > A version of this is in the following fork: > > https://bitbucket.org/gustafn/naviserver-connthreadqueue/changesets > > So far, the competition on the pool mutex is quite high, but > i think, it can be improved. Currently primarily the pool > mutex is used for conn thread life-cycle management, and it > is needed from the main/drivers/spoolers as well from the > connection threads to update the idle/running/.. counters > needed for controlling thread creation etc. Differentiating > these mutexes should help. > > i have not addressed the termination signaling, but that's > rather simple. > > -gustaf neumann > > On 28.10.12 03:08, Gustaf Neumann wrote: >> i've just implemented lightweight version of the above (just >> a few lines of code) by extending the connThread Arg >> structure; .... > > ------------------------------------------------------------------------------ > The Windows 8 Center - In partnership with Sourceforge > Your idea - your app - 30 days. > Get started! > http://windows8center.sourceforge.net/ > what-html-developers-need-to-know-about-coding-windows-8-metro-style-apps/ > _______________________________________________ > naviserver-devel mailing list > nav...@li... > https://lists.sourceforge.net/lists/listinfo/naviserver-devel |
From: Gustaf N. <ne...@wu...> - 2012-11-07 01:55:44
|
Some update: after some more testing with the new code, i still think, the version is promising, but needs a few tweaks. I have started to address the thread creation. To sum up the thread creation behavior/configuration of naviserver-tip: - minthreads (try to keep at least minthreads threads idle) - spread (fight against thread mass extinction due to round robin) - threadtimeout (useless due to round robin) - connsperthread (the only parameter effectively controlling the lifespan of an conn thread) - maxconnections (controls maximum number of connections in the waiting queue, including running threads) - concurrentcreatethreshold (percentage of waiting queue full, when to create threads in concurrently) Due to the policy of keeping at least minthreads idle, threads are preallocated when the load is high, the number of threads never falls under minthreads by construct. Threads stop mostly due to connsperthread. Naviserver with thread queue (fork) - minthreads (try to keep at least minthreads threads idle) - threadtimeout (works effectively, default 120 secs) - connsperthread (as before, just not varied via spread) - maxconnections (as before; use maybe "queuesize" instead) - lowwatermark (new) - highwatermark (was concurrentcreatethreshold) The parameter "spread" is already deleted, since the enqueueing takes care for a certain distribution, at least, when several threads are created. Threads are deleted often before connsperthread due to the timeout. Experiments show furthermore, that the rather agressive preallocation policy with minthreads idle threads causes now much more thread destroy and thread create operations than before. With with OpenACS, thread creation is compute-intense (about 1 sec). In the experimental version, connections are only queued when no connection thread is available (the tip version places every connection into the queue). Queueing happens with "bulky" requests, when e.g. a view causes a bunch (on average 5, often 10+, sometimes 50+) of requests for embedded resources (style files, javascript, images). It seems that permitting a few queued requests is often a good idea, since the connection threads will pick these up typically very quickly. To make the aggressiveness of the thread creation policy better configurable, the experimental version uses for this purpose solely the number of queued requests based on two parameters: - lowwatermark (if the actual queue size is below this value, don't try to create threads; default 5%) - highwatermark (if the actual queue size is above this value, allow parallel thread creates; default 80%) To increase the aggressiveness, one could set lowwatermark to e.g. 0, causing thread-creates, whenever a connection is queued. Increasing the lowwatermark reduces the willingness to create new threads. The highwatermark might be useful for benchmark situations, where the queue is filled up quickly. The default values seems to work quite well, it is used currently on http://next-scripting.org. However we still need some more experiments on different sites to get a better understanding. hmm final comment: for the regression test, i had to add the policy to create threads, when all connection threads are busy. The config file of the regression test uses connsperthread 0 (which is the default, but not very good as such), causing the exit every connection thread to exit after every threads. So, when the request comes in, that we have a thread busy, but nothing queued. So, there would not be the need to create a new thread. However, when the conn thread exists, the single request would not be processed. So, much more testing is needed. -gustaf neumann Am 01.11.12 20:17, schrieb Gustaf Neumann: > Dear all, > > There is now a version on bitbucket, which works quite nice > and stable, as far i can tell. I have split up the rather > coarse lock of all pools and introduced finer locks for > waiting queue (wqueue) and thread queue (tqueue) per pool. > The changes lead to significant finer lock granularity and > improve scalability. > > I have tested this new version with a synthetic load of 120 > requests per seconds, some slower requests and some faster > ones, and it appears to be pretty stable. This load keeps > about 20 connection threads quite busy on my home machine. > The contention of the new locks is very little: on this test > we saw 12 busy locks on 217.000 locks on the waiting queue, > and 9 busy locks out of 83.000 locks on the thread queue. > These measures are much better than in current naviserver, > which has on the same test on the queue 248.000 locks with > 190 busy ones. The total waiting time for locks is reduced > by a factor of 10. One has to add, that it was not so bad > before either. The benefit will be larger when multiple > pools are used. > > Finally i think, the code is clearer than before, where the > lock duration was quite tricky to determine. > > opinions? > -gustaf neumann > > PS: For the changes, see: > https://bitbucket.org/gustafn/naviserver-connthreadqueue/changesets > > PS2: have not addressed the server exit signaling yet. > > On 29.10.12 13:41, Gustaf Neumann wrote: >> A version of this is in the following fork: >> >> https://bitbucket.org/gustafn/naviserver-connthreadqueue/changesets >> >> So far, the competition on the pool mutex is quite high, but >> i think, it can be improved. Currently primarily the pool >> mutex is used for conn thread life-cycle management, and it >> is needed from the main/drivers/spoolers as well from the >> connection threads to update the idle/running/.. counters >> needed for controlling thread creation etc. Differentiating >> these mutexes should help. >> >> i have not addressed the termination signaling, but that's >> rather simple. >> >> -gustaf neumann >> >> On 28.10.12 03:08, Gustaf Neumann wrote: >>> i've just implemented lightweight version of the above (just >>> a few lines of code) by extending the connThread Arg >>> structure; .... >> ------------------------------------------------------------------------------ >> The Windows 8 Center - In partnership with Sourceforge >> Your idea - your app - 30 days. >> Get started! >> http://windows8center.sourceforge.net/ >> what-html-developers-need-to-know-about-coding-windows-8-metro-style-apps/ >> _______________________________________________ >> naviserver-devel mailing list >> nav...@li... >> https://lists.sourceforge.net/lists/listinfo/naviserver-devel > > ------------------------------------------------------------------------------ > Everyone hates slow websites. So do we. > Make your web apps faster with AppDynamics > Download AppDynamics Lite for free today: > http://p.sf.net/sfu/appdyn_sfd2d_oct > _______________________________________________ > naviserver-devel mailing list > nav...@li... > https://lists.sourceforge.net/lists/listinfo/naviserver-devel -- Univ.Prof. Dr. Gustaf Neumann Institute of Information Systems and New Media WU Vienna Augasse 2-6, A-1090 Vienna, AUSTRIA |
From: Gustaf N. <ne...@wu...> - 2012-11-13 11:18:47
|
Dear all, again some update: The mechanism sketched below works now as well in the regression test. There is now a backchannel in place that lets conn threads notify the driver to check the liveliness of the server. This backchannel makes as well the timeout based liveliness checking obsolete. By using the lowwatermark parameter to control thread creation, the resource consumption went down significantly without sacrificing speed for this setup. Here is some data from next-scripting.org, which is a rather idle site with real world traffic (including bots etc.). The server has defined minthreads = 2, is running 2 drivers (nssock + nsssl) and uses a writer thread. before (creating threads, when idle == 0, running server for 2 days) 10468 requests, connthreads 267 total cputime 00:10:32 new (creating threads, when queue >= 5, running server for 2 days) requests 10104 connthreads 27 total cputime 00:06:14 One can see, that the number of create operations for connection threads went down by a factor of 10 (from 267 to 27), and that the cpu consumption was reduced by about 40% (thread initialization costs 0.64 secs in this configuration). One can get a behavior similar to idle==0 by setting the low water mark to 0. The shutdown mechanism is now adjusted to the new infrastructure (connection threads have their own condition variable, so one cannot use the old broadcast to all conn threads anymore). -gustaf neumann Am 07.11.12 02:54, schrieb Gustaf Neumann: > Some update: after some more testing with the new code, i still think, > the version is promising, but needs a few tweaks. I have started to > address the thread creation. > > To sum up the thread creation behavior/configuration of naviserver-tip: > > - minthreads (try to keep at least minthreads threads idle) > - spread (fight against thread mass extinction due to round robin) > - threadtimeout (useless due to round robin) > - connsperthread (the only parameter effectively controlling the > lifespan of an conn thread) > - maxconnections (controls maximum number of connections in the > waiting queue, including running threads) > - concurrentcreatethreshold (percentage of waiting queue full, when > to create threads in concurrently) > > Due to the policy of keeping at least minthreads idle, threads are > preallocated when the load is high, the number of threads never falls > under minthreads by construct. Threads stop mostly due to connsperthread. > > Naviserver with thread queue (fork) > > - minthreads (try to keep at least minthreads threads idle) > - threadtimeout (works effectively, default 120 secs) > - connsperthread (as before, just not varied via spread) > - maxconnections (as before; use maybe "queuesize" instead) > - lowwatermark (new) > - highwatermark (was concurrentcreatethreshold) > > The parameter "spread" is already deleted, since the enqueueing takes > care for a certain distribution, at least, when several threads are > created. Threads are deleted often before connsperthread due to the > timeout. Experiments show furthermore, that the rather agressive > preallocation policy with minthreads idle threads causes now much more > thread destroy and thread create operations than before. With with > OpenACS, thread creation is compute-intense (about 1 sec). > > In the experimental version, connections are only queued when no > connection thread is available (the tip version places every > connection into the queue). Queueing happens with "bulky" requests, > when e.g. a view causes a bunch (on average 5, often 10+, sometimes > 50+) of requests for embedded resources (style files, javascript, > images). It seems that permitting a few queued requests is often a > good idea, since the connection threads will pick these up typically > very quickly. > > To make the aggressiveness of the thread creation policy better > configurable, the experimental version uses for this purpose solely > the number of queued requests based on two parameters: > > - lowwatermark (if the actual queue size is below this value, don't > try to create threads; default 5%) > - highwatermark (if the actual queue size is above this value, allow > parallel thread creates; default 80%) > > To increase the aggressiveness, one could set lowwatermark to e.g. 0, > causing thread-creates, whenever a connection is queued. Increasing > the lowwatermark reduces the willingness to create new threads. The > highwatermark might be useful for benchmark situations, where the > queue is filled up quickly. > > The default values seems to work quite well, it is used currently on > http://next-scripting.org. However we still need some more experiments > on different sites to get a better understanding. > > hmm final comment: for the regression test, i had to add the policy to > create threads, when all connection threads are busy. The config file > of the regression test uses connsperthread 0 (which is the default, > but not very good as such), causing the exit every connection thread > to exit after every threads. So, when the request comes in, that we > have a thread busy, but nothing queued. So, there would not be the > need to create a new thread. However, when the conn thread exists, the > single request would not be processed. > > So, much more testing is needed. > -gustaf neumann > > Am 01.11.12 20:17, schrieb Gustaf Neumann: >> Dear all, >> >> There is now a version on bitbucket, which works quite nice >> and stable, as far i can tell. I have split up the rather >> coarse lock of all pools and introduced finer locks for >> waiting queue (wqueue) and thread queue (tqueue) per pool. >> The changes lead to significant finer lock granularity and >> improve scalability. >> >> I have tested this new version with a synthetic load of 120 >> requests per seconds, some slower requests and some faster >> ones, and it appears to be pretty stable. This load keeps >> about 20 connection threads quite busy on my home machine. >> The contention of the new locks is very little: on this test >> we saw 12 busy locks on 217.000 locks on the waiting queue, >> and 9 busy locks out of 83.000 locks on the thread queue. >> These measures are much better than in current naviserver, >> which has on the same test on the queue 248.000 locks with >> 190 busy ones. The total waiting time for locks is reduced >> by a factor of 10. One has to add, that it was not so bad >> before either. The benefit will be larger when multiple >> pools are used. >> >> Finally i think, the code is clearer than before, where the >> lock duration was quite tricky to determine. >> >> opinions? >> -gustaf neumann >> >> PS: For the changes, see: >> https://bitbucket.org/gustafn/naviserver-connthreadqueue/changesets >> >> PS2: have not addressed the server exit signaling yet. >> >> On 29.10.12 13:41, Gustaf Neumann wrote: >>> A version of this is in the following fork: >>> >>> https://bitbucket.org/gustafn/naviserver-connthreadqueue/changesets >>> >>> So far, the competition on the pool mutex is quite high, but >>> i think, it can be improved. Currently primarily the pool >>> mutex is used for conn thread life-cycle management, and it >>> is needed from the main/drivers/spoolers as well from the >>> connection threads to update the idle/running/.. counters >>> needed for controlling thread creation etc. Differentiating >>> these mutexes should help. >>> >>> i have not addressed the termination signaling, but that's >>> rather simple. >>> >>> -gustaf neumann >>> >>> On 28.10.12 03:08, Gustaf Neumann wrote: >>>> i've just implemented lightweight version of the above (just >>>> a few lines of code) by extending the connThread Arg >>>> structure; .... |
From: Stephen D. <sd...@gm...> - 2012-11-13 14:03:10
|
On Tue, Nov 13, 2012 at 11:18 AM, Gustaf Neumann <ne...@wu...> wrote: > > minthreads = 2 > > creating threads, when idle == 0 > 10468 requests, connthreads 267 > total cputime 00:10:32 > > creating threads, when queue >= 5 > requests 10104 connthreads 27 > total cputime 00:06:14 What if you set minthreads == maxthreads? |
From: Gustaf N. <ne...@wu...> - 2012-11-14 08:52:04
|
On 13.11.12 15:02, Stephen Deasey wrote: > On Tue, Nov 13, 2012 at 11:18 AM, Gustaf Neumann <ne...@wu...> wrote: >> minthreads = 2 >> >> creating threads, when idle == 0 >> 10468 requests, connthreads 267 >> total cputime 00:10:32 >> >> creating threads, when queue >= 5 >> requests 10104 connthreads 27 >> total cputime 00:06:14 > What if you set minthreads == maxthreads? The number of thread create operations will go further down. When running already at minthreads, the connection thread timeout is ignored (otherwise there would be a high number of thread create operations just after the timeout expires to ensure minthreads running connection threads). With connsperthread == 1000, there will be about 10 thread create operations for 10000 requests (not counting the 2 initial create operation during startup for minthreads == 2). So, the cpu consumption will be lower, but the server would not scale, when the requests frequency would require more connection threads. Furthermore, there will be most likely more requests put into the queue instead of being served immediately. When we assume, that with minthreads == maxthreads == 2 there won't be more than say 20 requests queued, a similar effect could be achieved by allowing additional thread creations for more than 20 requests in the queue. Or even more conservative, allowing thread creations only when the request queue is completely full (setting the low water mark to 100%) would as well be better than minthreads == maxthreads, since the server will at least start to create additional threads in this rather hopeless situation, where with minthreads == maxthreads, it won't. |
From: Gustaf N. <ne...@wu...> - 2012-11-18 13:22:12
|
On 14.11.12 09:51, Gustaf Neumann wrote: > On 13.11.12 15:02, Stephen Deasey wrote: >> On Tue, Nov 13, 2012 at 11:18 AM, Gustaf Neumann <ne...@wu...> wrote: >>> minthreads = 2 >>> >>> creating threads, when idle == 0 >>> 10468 requests, connthreads 267 >>> total cputime 00:10:32 >>> >>> creating threads, when queue >= 5 >>> requests 10104 connthreads 27 >>> total cputime 00:06:14 >> What if you set minthreads == maxthreads? > The number of thread create operations will go further down. Here are some actual figures with a comparable number of requests: with minthreads==maxthreads==2 requests 10182 queued 2695 connthreads 11 cpu 00:05:27 rss 415 below are the previous values, competed by the number of queuing operations and the rss size in MV with minthreads=2, create when queue >= 2 requests 10104 queued 1584 connthreads 27 cpu 00:06:14 rss 466 as anticipated, thread creations and cpu consumption went down, but the number of queued requests (requests that could not be executed immediately) increased significantly. Maybe the most significant benefit of a low maxthreads value is the reduced memory consumption. On this machine we are using plain Tcl with its "zippy malloc", which does not release memory (once allocated to its pool) back to the OS. So, the measured memsize depends on the max number of threads with tcl interps, especially with large blueprints (as in the case of OpenACS). This situation can be improved with e.g. jemalloc (what we are using in production, which requires a modified tcl), but after about 2 or 3 days running a server the rss sizes are very similar (most likely due to fragmentation). -gustaf neumann > When running already at minthreads, the connection thread > timeout is ignored (otherwise there would be a high number > of thread create operations just after the timeout expires > to ensure minthreads running connection threads). With > connsperthread == 1000, there will be about 10 thread create > operations for 10000 requests (not counting the 2 initial > create operation during startup for minthreads == 2). So, > the cpu consumption will be lower, but the server would not > scale, when the requests frequency would require more > connection threads. Furthermore, there will be most likely > more requests put into the queue instead of being served > immediately. > > When we assume, that with minthreads == maxthreads == 2 > there won't be more than say 20 requests queued, a similar > effect could be achieved by allowing additional thread > creations for more than 20 requests in the queue. Or even > more conservative, allowing thread creations only when the > request queue is completely full (setting the low water mark > to 100%) would as well be better than minthreads == > maxthreads, since the server will at least start to create > additional threads in this rather hopeless situation, where > with minthreads == maxthreads, it won't. > > > > > > ------------------------------------------------------------------------------ > Monitor your physical, virtual and cloud infrastructure from a single > web console. Get in-depth insight into apps, servers, databases, vmware, > SAP, cloud infrastructure, etc. Download 30-day Free Trial. > Pricing starts from $795 for 25 servers or applications! > http://p.sf.net/sfu/zoho_dev2dev_nov > _______________________________________________ > naviserver-devel mailing list > nav...@li... > https://lists.sourceforge.net/lists/listinfo/naviserver-devel |
From: Stephen D. <sd...@gm...> - 2012-11-18 19:35:19
|
On Sun, Nov 18, 2012 at 1:22 PM, Gustaf Neumann <ne...@wu...> wrote: > On 14.11.12 09:51, Gustaf Neumann wrote: > > On 13.11.12 15:02, Stephen Deasey wrote: > > On Tue, Nov 13, 2012 at 11:18 AM, Gustaf Neumann <ne...@wu...> wrote: > > minthreads = 2 > > creating threads, when idle == 0 > 10468 requests, connthreads 267 > total cputime 00:10:32 > > creating threads, when queue >= 5 > requests 10104 connthreads 27 > total cputime 00:06:14 > > What if you set minthreads == maxthreads? > > The number of thread create operations will go further down. > > Here are some actual figures with a comparable number of requests: > > with minthreads==maxthreads==2 > requests 10182 queued 2695 connthreads 11 cpu 00:05:27 rss 415 > > below are the previous values, competed by the number of queuing operations > and the rss size in MV > > with minthreads=2, create when queue >= 2 > requests 10104 queued 1584 connthreads 27 cpu 00:06:14 rss 466 > > as anticipated, thread creations and cpu consumption went down, but the > number of queued requests (requests that could not be executed immediately) > increased significantly. I was thinking of the opposite: make min/max threads equal by increasing min threads. Requests would never stall in the queue, unlike the experiment you ran with max threads reduced to min threads. But there's another benefit: unlike the dynamic scenario requests would also never stall in the queue when a new thread had to be started when min < max threads. What is the down side to increasing min threads up to max threads? > Maybe the most significant benefit of a low maxthreads value is the reduced > memory consumption. On this machine we are using plain Tcl with its "zippy > malloc", which does not release memory (once allocated to its pool) back to > the OS. So, the measured memsize depends on the max number of threads with > tcl interps, especially with large blueprints (as in the case of OpenACS). Right: the max number of threads *ever*, not just currently. So by killing threads you don't reduce memory usage, but you do increase latency for some requests which have to wait for a thread+interp to be created. Is it convenient to measure latency distribution (not just average)? I guess not: we record conn.startTime when a connection is taken out of the queue and passed to a conn thread, but we don't record the time when a socket was accepted. Actually, managing request latency is another area we don't handle so well. You can influence it by adjusting the OS listen socket accept queue length, you can adjust the length of the naviserver queue, and with the proposed change here you can change how aggressive new threads are created to process requests in the queue. But queue-depth is a roundabout way of specifying milliseconds of latency. And not just round-about but inherently imprecise as different URLs are going to require different amounts of time to complete, and which URLs are requested is a function of current traffic. If instead of queue size you could specify a target latency then we could maybe do smarter things with the queue, such as pull requests off the back of the queue which have been waiting longer than the target latency, making room for fresh requests on the front of the queue. |
From: Gustaf N. <ne...@wu...> - 2012-11-19 11:30:28
|
On 18.11.12 20:34, Stephen Deasey wrote: > On Sun, Nov 18, 2012 at 1:22 PM, Gustaf Neumann <ne...@wu...> wrote: >> Here are some actual figures with a comparable number of requests: >> >> with minthreads==maxthreads==2 >> requests 10182 queued 2695 connthreads 11 cpu 00:05:27 rss 415 >> >> below are the previous values, competed by the number of queuing operations >> and the rss size in MV >> >> with minthreads=2, create when queue >= 2 >> requests 10104 queued 1584 connthreads 27 cpu 00:06:14 rss 466 >> >> as anticipated, thread creations and cpu consumption went down, but the >> number of queued requests (requests that could not be executed immediately) >> increased significantly. > I was thinking of the opposite: make min/max threads equal by > increasing min threads. Requests would never stall in the queue, > unlike the experiment you ran with max threads reduced to min threads. on the site, we have maxthreads 10. so setting minthreads as well to 10 has the consequence of a larger memsize (and queued substantially reduced). > But there's another benefit: unlike the dynamic scenario requests > would also never stall in the queue when a new thread had to be > started when min < max threads. you are talking about naviserver before 4.99.4. Both, the version in the tip naviserver repository and the forked version provide already warmed up threads. The version on the main tip starts to listen to the wakup signals only once it is warmed up, the version in the fork adds a thread to the conn queue as well only after the startup is complete. So, in both cases there is no stall. In earlier version, this was as you describe. > What is the down side to increasing min threads up to max threads? higher memory consumption, maybe more open database connections, allocating resources which are not needed. The degree of wastefullness depends certainly on maxthreads. I would assume that for an admin carefully watching the server needs, setting minthreads==maxthreads to the "right value" can lead to slight improvements as long the load is rather constant over time. >> Maybe the most significant benefit of a low maxthreads value is the reduced >> memory consumption. On this machine we are using plain Tcl with its "zippy >> malloc", which does not release memory (once allocated to its pool) back to >> the OS. So, the measured memsize depends on the max number of threads with >> tcl interps, especially with large blueprints (as in the case of OpenACS). > Right: the max number of threads *ever*, not just currently. So by > killing threads you don't reduce memory usage, but you do increase > latency for some requests which have to wait for a thread+interp to be > created. not really with the warm-up feature. > Is it convenient to measure latency distribution (not just average)? I > guess not: we record conn.startTime when a connection is taken out of > the queue and passed to a conn thread, but we don't record the time > when a socket was accepted. we could record the socket accept time and measure the difference until the start of the connection runtime; when we output this to the accesslog (like logreqtime) we could run whatever statistics we want. > Actually, managing request latency is another area we don't handle so > well. You can influence it by adjusting the OS listen socket accept > queue length, you can adjust the length of the naviserver queue, and > with the proposed change here you can change how aggressive new > threads are created to process requests in the queue. But queue-depth > is a roundabout way of specifying milliseconds of latency. And not > just round-about but inherently imprecise as different URLs are going > to require different amounts of time to complete, and which URLs are > requested is a function of current traffic. If instead of queue size > you could specify a target latency then we could maybe do smarter > things with the queue, such as pull requests off the back of the queue > which have been waiting longer than the target latency, making room > for fresh requests on the front of the queue. The idea of controlling the number of running threads via queuing latency is interesting, but i have to look into the details before i can comment on this. -gustaf |
From: Gustaf N. <ne...@wu...> - 2012-11-20 19:07:40
|
Dear all, > The idea of controlling the number of running threads via queuing > latency is interesting, but i have to look into the details before i > can comment on this. before one can consider controlling the number of running threads via queuing latency, one has to improve the awareness in NaviServer about the various phases in the requests lifetime. In the experimental version, we have now the following time stamps recorded: - acceptTime (the time, a socket was accepted) - requestQueueTime (the time the request was queued; was startTime) - requestDequeueTime (the time the request was dequeued) The difference between requestQueueTime and acceptTime is the setup cost and depends on the amount of work, the driver does. For instance, nssock of naviserver performs read-ahead, while nsssl does not and passes connection right away. So, the previously used startTime (which is acctually the time the request was queued) was for drivers with read ahead not correct. In the experimental version, [ns_conn start] returns now always the accept time. The next paragraph uses the term endTime, which is the time, when a connection thread is done with a request (either the content was delivered, or the content was handed over to a writer thread). The difference between requestDequeueTime and requestQueueTime is the time spent in the queue. The difference between endTime and requestDequeueTime is the pure runtime, the difference between endTime and acceptTime is the totalTime. As a rough approximation the time between requestDequeueTime and acceptTime is pretty much influenced by the server setup, and the runtime by the application. I used the term "approximation" since the runtime of certain other requests influences the queuing time, as we see in the following: Consider a server with two running connection threads receiving 6 requests, where requests 2-5 are received in a very short time. The first three requests are directly assign to connection threads, have fromQueue == 0. These have queuing times between 88 and 110 micro seconds, which includes signal sending/receiving, thread change, and initial setup in the connection thread. The runtimes for these requests are pretty bad, in the range of 0.24 to 3.8 seconds elapsed time. [1] waiting 0 current 2 idle 1 ncons 999 fromQueue 0 accept 0.000000 queue 0.000110 run 0.637781 total 0.637891 [2] waiting 3 current 2 idle 0 ncons 998 fromQueue 0 accept 0.000000 queue 0.000090 run 0.245030 total 0.245120 [3] waiting 2 current 2 idle 0 ncons 987 fromQueue 0 accept 0.000000 queue 0.000088 run 0.432421 total 0.432509 [4] waiting 1 current 2 idle 0 ncons 997 fromQueue 1 accept 0.000000 queue 0.244246 run 0.249208 total 0.493454 [5] waiting 0 current 2 idle 0 ncons 986 fromQueue 1 accept 0.000000 queue 0.431545 run 3.713331 total 4.144876 [6] waiting 0 current 2 idle 0 ncons 996 fromQueue 1 accept 0.000000 queue 0.480382 run 3.799818 total 4.280200 Requests [4, 5, 6] are queued, and have queuing times between 0.2 and 0.5 seconds. The queuing times are pretty much the runtimes of [2, 3, 4], therefore the runtime determines the queuing time. For example, the totalTime of request [4] was 0.493454 secs, half of the time it was waiting in the queue. Request [4] can consider itself happy that it was not scheduled after [5] or [6], where its totalTime would have been likely in the range of 4 secs (10 times slower). Low waiting times are essential for good performance. This example shows pretty well the importance of aync delivery mechanisms like the writer thread or bgdelivery in OpenACS. A file being delivered by the connection thread over a slow internet connection might block later requests substantially (as in the cases above). This is even more important for todays web-sites, where a single view might entail 60+ embedded requests for js, css, images, .... where it is not feasible to defined hundreds of connection threads. Before going further into detail i'll provide further introspection mechanism to the experimental version. - [ns_server stats] adding total waiting time - [ns_conn queuetime] ... time spent in queue - [ns_conn dequeue] ... time stamp when the req starts to actually run (similar to [ns_conn start]) The first can be used for server monitoring, the next two for single connection introspection. The queuetime can be useful for better awareness and for optional output in the access log, and the dequeue time-stamp for application level profiling as base for a difference with current time. Further wishes, suggestions, comments? -gustaf neumann |