From: Gustaf N. <ne...@wu...> - 2012-11-28 10:38:33
|
Dear all, here is a short update of the findings and changes since last week. One can now activate for nslog "logpartialtimes", which adds an entry to the access log containing the partial request times (accept time, queuing time, and run time). The sample site runs now with minthreads == 5, maxthreads 10. If one analyzes the data of one day of our sample site, one can see that in this case, the accept time is always 0 (this is caused by nsssl, which is a sync driver, header parsing happens solely in the connection thread), the queue time is on average 54 ms (milliseconds), which is huge, while the median is 0.1 ms. The reason is that we can see a maximum queue time of 21 secs! As a consequence, the standard deviation is as well huge. When one looks more closely to the log file, one can see that at times, when the queuing time is huge, as well the runtime is often huge, leading to cascading effects as described below. cause and reason are hard to determine, since we saw large runtimes even on delivery of rather small times accept queue run total avg 0,0000 0,0547 0,2656 0,3203 stdev 0,0000 0,7481 0,9740 1,4681 min 0,0000 0,0000 0,0134 0,0135 max 0,0000 21,9594 16,6905 29,1668 median 0,0000 0,0001 0,0329 0,0330 The causes for the "random" slow cases are at least from two sources: slow delivery (some clients are slow in retrieving data, especially some bots seems to be bad) and system specific latencies (e.g. backup running, other servers becoming active, etc.). To address this problem, i moved more file deliveries to the writer thread (reducing writersize) and activated TCP_DEFER_ACCEPT. One sees significant improvements, the average queuing time is 4.6 ms, but still the standard deviation is huge, caused by some "erratic" some large times. accept queue run total avg 0,0000 0,0046 0,1538 0,1583 stdev 0,0000 0,0934 0,6868 0,6967 min 0,0000 0,0000 0,0137 0,0138 max 0,0000 4,6041 22,4101 22,4102 median 0,0000 0,0001 0,0216 0,0217 There are still some huge values, and the queuing time was "lucky" not to be influenced by the still large runtimes of requests. Aside of the top values, one sees sometimes unexplained delays of a few seconds. which seem to block everything. A large source for latencies is the file system. On our production system we already invested some efforts to tune the file system in the past, such we don't see effects as large as on this sample site. Since NaviServer should work well also on non-tuned file systems, i implemented an async writer thread, that decouples writing to files (access.log and error.log) from the connection threads. The interface is the same as unix write(), but does not block. I have this running since yesterday evening on the sample site, and do far, it seems to help to reduce the random latencies significantly. i'll wait with posting data until i have a similar time range to compare, maybe the good times i am seeing are accidentally. I have not yet looked into the details for handling the async writer withing logrotate and shutdown. On another front, i was experimenting with jemalloc and tcmalloc. Previously we had the following figures, using tcl as distributed with zippy malloc with minthreads=2, maxthreads=2 requests 10307 spools 1280 queued 2704 connthreads 11 rss 425 When changing minthreads to 5, maxthreads 10, but using jemalloc, one sees the following figures requests 3933 spools 188 queued 67 connthreads 7 rss 488 requests 6325 spools 256 queued 90 connthreads 9 rss 528 requests 10021 spools 378 queued 114 connthreads 14 rss 530 It is interesting to see, that with always 5 connections threads running and using jemalloc, we see a rss consumption only slightly larger than with plain tcl and zippy malloc having maxthreads == 2, having less requests queued. Similarly, with tcmalloc we see with minthreads to 5, maxthreads 10 requests 2062 spools 49 queued 3 connthreads 6 rss 376 requests 7743 spools 429 queued 359 connthreads 11 rss 466 requests 8389 spools 451 queued 366 connthreads 12 rss 466 which is even better. For more information on malloc tests see https://next-scripting.org/xowiki/docs/misc/thread-mallocs or the tcl-core mailing list. That's all for now -gustaf neumann Am 20.11.12 20:07, schrieb Gustaf Neumann: > Dear all, > >> The idea of controlling the number of running threads via queuing >> latency is interesting, but i have to look into the details before i >> can comment on this. > before one can consider controlling the number of running threads via > queuing latency, one has to improve the awareness in NaviServer about > the various phases in the requests lifetime. > > In the experimental version, we have now the following time stamps > recorded: > > - acceptTime (the time, a socket was accepted) > - requestQueueTime (the time the request was queued; was startTime) > - requestDequeueTime (the time the request was dequeued) > > The difference between requestQueueTime and acceptTime is the setup > cost and depends on the amount of work, the driver does. For instance, > nssock of naviserver performs read-ahead, while nsssl does not and > passes connection right away. So, the previously used startTime (which > is acctually the time the request was queued) was for drivers with > read ahead not correct. In the experimental version, [ns_conn start] > returns now always the accept time. > > The next paragraph uses the term endTime, which is the time, when a > connection thread is done with a request (either the content was > delivered, or the content was handed over to a writer thread). > > The difference between requestDequeueTime and requestQueueTime is the > time spent in the queue. The difference between endTime and > requestDequeueTime is the pure runtime, the difference between endTime > and acceptTime is the totalTime. As a rough approximation the time > between requestDequeueTime and acceptTime is pretty much influenced by > the server setup, and the runtime by the application. I used the term > "approximation" since the runtime of certain other requests influences > the queuing time, as we see in the following: > > Consider a server with two running connection threads receiving 6 > requests, where requests 2-5 are received in a very short time. The > first three requests are directly assign to connection threads, have > fromQueue == 0. These have queuing times between 88 and 110 micro > seconds, which includes signal sending/receiving, thread change, and > initial setup in the connection thread. The runtimes for these > requests are pretty bad, in the range of 0.24 to 3.8 seconds elapsed time. > [1] waiting 0 current 2 idle 1 ncons 999 fromQueue 0 accept 0.000000 queue 0.000110 run 0.637781 total 0.637891 > [2] waiting 3 current 2 idle 0 ncons 998 fromQueue 0 accept 0.000000 queue 0.000090 run 0.245030 total 0.245120 > [3] waiting 2 current 2 idle 0 ncons 987 fromQueue 0 accept 0.000000 queue 0.000088 run 0.432421 total 0.432509 > [4] waiting 1 current 2 idle 0 ncons 997 fromQueue 1 accept 0.000000 queue 0.244246 run 0.249208 total 0.493454 > [5] waiting 0 current 2 idle 0 ncons 986 fromQueue 1 accept 0.000000 queue 0.431545 run 3.713331 total 4.144876 > [6] waiting 0 current 2 idle 0 ncons 996 fromQueue 1 accept 0.000000 queue 0.480382 run 3.799818 total 4.280200 > > Requests [4, 5, 6] are queued, and have queuing times between 0.2 and > 0.5 seconds. The queuing times are pretty much the runtimes of [2, 3, > 4], therefore the runtime determines the queuing time. For example, > the totalTime of request [4] was 0.493454 secs, half of the time it > was waiting in the queue. Request [4] can consider itself happy that > it was not scheduled after [5] or [6], where its totalTime would have > been likely in the range of 4 secs (10 times slower). Low waiting > times are essential for good performance. > > This example shows pretty well the importance of aync delivery > mechanisms like the writer thread or bgdelivery in OpenACS. A file > being delivered by the connection thread over a slow internet > connection might block later requests substantially (as in the cases > above). This is even more important for todays web-sites, where a > single view might entail 60+ embedded requests for js, css, images, > .... where it is not feasible to defined hundreds of connection threads. > > Before going further into detail i'll provide further introspection > mechanism to the experimental version. > > - [ns_server stats] adding total waiting time > - [ns_conn queuetime] ... time spent in queue > - [ns_conn dequeue] ... time stamp when the req starts to actually > run (similar to [ns_conn start]) > > The first can be used for server monitoring, the next two for single > connection introspection. The queuetime can be useful for better > awareness and for optional output in the access log, and the dequeue > time-stamp for application level profiling as base for a difference > with current time. > > Further wishes, suggestions, comments? > > -gustaf neumann > |
From: Gustaf N. <ne...@wu...> - 2012-11-29 18:51:44
|
Dear all, one more update. Here is now the data of a full day with the Async-Writer thread. The avg queueing time dropped from 0,0547 to 0,0009, the standard deviation from 0,7481 to 0,0211. This is a noticable improvement. accept queue run total avg 0,0000 0,0009 0,1304 0,1313 stdev 0,0000 0,0211 0,9022 0,9027 min 0,0000 0,0000 0,0032 0,0033 max 0,0000 0,9772 28,1699 28,1700 median 0,0000 0,0001 0,0208 0,0209 But still, the sometime large values are worth to investigate. It turned out that the large queueing time came from requests from taipeh, which contained several 404 errors. The size of the 404 request is 727 bytes, and therefore under the writersize, which was configured as 1000. The delivery of an error message takes to this site more than a second. Funny enough, the delivery of the error message blocked the connection thread longer than the delivery of the image when it is above the writersize. I will reduce the writersize further, but still a slow delivery can even slow down the delivery of the headers, which happens still in the connection thread. Most likely, i won't address this now. however, i will check the effects of fastpath cache, which was deactivated so far... best regards -gustaf neumann Am 28.11.12 11:38, schrieb Gustaf Neumann: > Dear all, > > here is a short update of the findings and changes since last week. > > One can now activate for nslog "logpartialtimes", which adds an entry > to the access log containing the partial request times (accept time, > queuing time, and run time). The sample site runs now with minthreads > == 5, maxthreads 10. > > If one analyzes the data of one day of our sample site, one can see > that in this case, the accept time is always 0 (this is caused by > nsssl, which is a sync driver, header parsing happens solely in the > connection thread), the queue time is on average 54 ms (milliseconds), > which is huge, while the median is 0.1 ms. The reason is that we can > see a maximum queue time of 21 secs! As a consequence, the standard > deviation is as well huge. When one looks more closely to the log > file, one can see that at times, when the queuing time is huge, as > well the runtime is often huge, leading to cascading effects as > described below. cause and reason are hard to determine, since we saw > large runtimes even on delivery of rather small times > accept queue run total > avg 0,0000 0,0547 0,2656 0,3203 > stdev 0,0000 0,7481 0,9740 1,4681 > min 0,0000 0,0000 0,0134 0,0135 > max 0,0000 21,9594 16,6905 29,1668 > median 0,0000 0,0001 0,0329 0,0330 > > The causes for the "random" slow cases are at least from two sources: > slow delivery (some clients are slow in retrieving data, especially > some bots seems to be bad) and system specific latencies (e.g. backup > running, other servers becoming active, etc.). To address this > problem, i moved more file deliveries to the writer thread (reducing > writersize) and activated TCP_DEFER_ACCEPT. One sees significant > improvements, the average queuing time is 4.6 ms, but still the > standard deviation is huge, caused by some "erratic" some large times. > accept queue run total > avg 0,0000 0,0046 0,1538 0,1583 > stdev 0,0000 0,0934 0,6868 0,6967 > min 0,0000 0,0000 0,0137 0,0138 > max 0,0000 4,6041 22,4101 22,4102 > median 0,0000 0,0001 0,0216 0,0217 > > There are still some huge values, and the queuing time was "lucky" not > to be influenced by the still large runtimes of requests. Aside of the > top values, one sees sometimes unexplained delays of a few seconds. > which seem to block everything. > > A large source for latencies is the file system. On our production > system we already invested some efforts to tune the file system in the > past, such we don't see effects as large as on this sample site. Since > NaviServer should work well also on non-tuned file systems, i > implemented an async writer thread, that decouples writing to files > (access.log and error.log) from the connection threads. The interface > is the same as unix write(), but does not block. > > I have this running since yesterday evening on the sample site, and do > far, it seems to help to reduce the random latencies significantly. > i'll wait with posting data until i have a similar time range to > compare, maybe the good times i am seeing are accidentally. > > I have not yet looked into the details for handling the async writer > withing logrotate and shutdown. > > On another front, i was experimenting with jemalloc and tcmalloc. > Previously we had the following figures, using tcl as distributed with > zippy malloc > > with minthreads=2, maxthreads=2 > requests 10307 spools 1280 queued 2704 connthreads 11 rss 425 > > When changing minthreads to 5, maxthreads 10, but using jemalloc, > one sees the following figures > requests 3933 spools 188 queued 67 connthreads 7 rss 488 > requests 6325 spools 256 queued 90 connthreads 9 rss 528 > requests 10021 spools 378 queued 114 connthreads 14 rss 530 > It is interesting to see, that with always 5 connections threads > running and using jemalloc, we see a rss consumption only slightly > larger than with plain tcl and zippy malloc having maxthreads == 2, > having less requests queued. > > Similarly, with tcmalloc we see with minthreads to 5, maxthreads 10 > requests 2062 spools 49 queued 3 connthreads 6 rss 376 > requests 7743 spools 429 queued 359 connthreads 11 rss 466 > requests 8389 spools 451 queued 366 connthreads 12 rss 466 > which is even better. > > For more information on malloc tests see > https://next-scripting.org/xowiki/docs/misc/thread-mallocs > or the tcl-core mailing list. > > That's all for now > > -gustaf neumann > > Am 20.11.12 20:07, schrieb Gustaf Neumann: >> Dear all, >> >>> The idea of controlling the number of running threads via queuing >>> latency is interesting, but i have to look into the details before i >>> can comment on this. >> before one can consider controlling the number of running threads via >> queuing latency, one has to improve the awareness in NaviServer about >> the various phases in the requests lifetime. >> >> In the experimental version, we have now the following time stamps >> recorded: >> >> - acceptTime (the time, a socket was accepted) >> - requestQueueTime (the time the request was queued; was startTime) >> - requestDequeueTime (the time the request was dequeued) >> >> The difference between requestQueueTime and acceptTime is the setup >> cost and depends on the amount of work, the driver does. For >> instance, nssock of naviserver performs read-ahead, while nsssl does >> not and passes connection right away. So, the previously used >> startTime (which is acctually the time the request was queued) was >> for drivers with read ahead not correct. In the experimental version, >> [ns_conn start] returns now always the accept time. >> >> The next paragraph uses the term endTime, which is the time, when a >> connection thread is done with a request (either the content was >> delivered, or the content was handed over to a writer thread). >> >> The difference between requestDequeueTime and requestQueueTime is the >> time spent in the queue. The difference between endTime and >> requestDequeueTime is the pure runtime, the difference between >> endTime and acceptTime is the totalTime. As a rough approximation the >> time between requestDequeueTime and acceptTime is pretty much >> influenced by the server setup, and the runtime by the application. I >> used the term "approximation" since the runtime of certain other >> requests influences the queuing time, as we see in the following: >> >> Consider a server with two running connection threads receiving 6 >> requests, where requests 2-5 are received in a very short time. The >> first three requests are directly assign to connection threads, have >> fromQueue == 0. These have queuing times between 88 and 110 micro >> seconds, which includes signal sending/receiving, thread change, and >> initial setup in the connection thread. The runtimes for these >> requests are pretty bad, in the range of 0.24 to 3.8 seconds elapsed >> time. >> [1] waiting 0 current 2 idle 1 ncons 999 fromQueue 0 accept 0.000000 queue 0.000110 run 0.637781 total 0.637891 >> [2] waiting 3 current 2 idle 0 ncons 998 fromQueue 0 accept 0.000000 queue 0.000090 run 0.245030 total 0.245120 >> [3] waiting 2 current 2 idle 0 ncons 987 fromQueue 0 accept 0.000000 queue 0.000088 run 0.432421 total 0.432509 >> [4] waiting 1 current 2 idle 0 ncons 997 fromQueue 1 accept 0.000000 queue 0.244246 run 0.249208 total 0.493454 >> [5] waiting 0 current 2 idle 0 ncons 986 fromQueue 1 accept 0.000000 queue 0.431545 run 3.713331 total 4.144876 >> [6] waiting 0 current 2 idle 0 ncons 996 fromQueue 1 accept 0.000000 queue 0.480382 run 3.799818 total 4.280200 >> >> Requests [4, 5, 6] are queued, and have queuing times between 0.2 and >> 0.5 seconds. The queuing times are pretty much the runtimes of [2, 3, >> 4], therefore the runtime determines the queuing time. For example, >> the totalTime of request [4] was 0.493454 secs, half of the time it >> was waiting in the queue. Request [4] can consider itself happy that >> it was not scheduled after [5] or [6], where its totalTime would have >> been likely in the range of 4 secs (10 times slower). Low waiting >> times are essential for good performance. >> >> This example shows pretty well the importance of aync delivery >> mechanisms like the writer thread or bgdelivery in OpenACS. A file >> being delivered by the connection thread over a slow internet >> connection might block later requests substantially (as in the cases >> above). This is even more important for todays web-sites, where a >> single view might entail 60+ embedded requests for js, css, images, >> .... where it is not feasible to defined hundreds of connection threads. >> >> Before going further into detail i'll provide further introspection >> mechanism to the experimental version. >> >> - [ns_server stats] adding total waiting time >> - [ns_conn queuetime] ... time spent in queue >> - [ns_conn dequeue] ... time stamp when the req starts to actually >> run (similar to [ns_conn start]) >> >> The first can be used for server monitoring, the next two for single >> connection introspection. The queuetime can be useful for better >> awareness and for optional output in the access log, and the dequeue >> time-stamp for application level profiling as base for a difference >> with current time. >> >> Further wishes, suggestions, comments? >> >> -gustaf neumann >> |
From: Maurizio M. <Mau...@sp...> - 2012-11-29 19:30:30
|
This is good engineering work. I rise my hat to that. Congrats, Maurizio From: Gustaf Neumann [mailto:ne...@wu...] Sent: 29 November 2012 19:51 To: nav...@li... Subject: Re: [naviserver-devel] naviserver with connection thread queue Dear all, one more update. Here is now the data of a full day with the Async-Writer thread. The avg queueing time dropped from 0,0547 to 0,0009, the standard deviation from 0,7481 to 0,0211. This is a noticable improvement. accept queue run total avg 0,0000 0,0009 0,1304 0,1313 stdev 0,0000 0,0211 0,9022 0,9027 min 0,0000 0,0000 0,0032 0,0033 max 0,0000 0,9772 28,1699 28,1700 median 0,0000 0,0001 0,0208 0,0209 But still, the sometime large values are worth to investigate. It turned out that the large queueing time came from requests from taipeh, which contained several 404 errors. The size of the 404 request is 727 bytes, and therefore under the writersize, which was configured as 1000. The delivery of an error message takes to this site more than a second. Funny enough, the delivery of the error message blocked the connection thread longer than the delivery of the image when it is above the writersize. I will reduce the writersize further, but still a slow delivery can even slow down the delivery of the headers, which happens still in the connection thread. Most likely, i won't address this now. however, i will check the effects of fastpath cache, which was deactivated so far... best regards -gustaf neumann Am 28.11.12 11:38, schrieb Gustaf Neumann: Dear all, here is a short update of the findings and changes since last week. One can now activate for nslog "logpartialtimes", which adds an entry to the access log containing the partial request times (accept time, queuing time, and run time). The sample site runs now with minthreads == 5, maxthreads 10. If one analyzes the data of one day of our sample site, one can see that in this case, the accept time is always 0 (this is caused by nsssl, which is a sync driver, header parsing happens solely in the connection thread), the queue time is on average 54 ms (milliseconds), which is huge, while the median is 0.1 ms. The reason is that we can see a maximum queue time of 21 secs! As a consequence, the standard deviation is as well huge. When one looks more closely to the log file, one can see that at times, when the queuing time is huge, as well the runtime is often huge, leading to cascading effects as described below. cause and reason are hard to determine, since we saw large runtimes even on delivery of rather small times accept queue run total avg 0,0000 0,0547 0,2656 0,3203 stdev 0,0000 0,7481 0,9740 1,4681 min 0,0000 0,0000 0,0134 0,0135 max 0,0000 21,9594 16,6905 29,1668 median 0,0000 0,0001 0,0329 0,0330 The causes for the "random" slow cases are at least from two sources: slow delivery (some clients are slow in retrieving data, especially some bots seems to be bad) and system specific latencies (e.g. backup running, other servers becoming active, etc.). To address this problem, i moved more file deliveries to the writer thread (reducing writersize) and activated TCP_DEFER_ACCEPT. One sees significant improvements, the average queuing time is 4.6 ms, but still the standard deviation is huge, caused by some "erratic" some large times. accept queue run total avg 0,0000 0,0046 0,1538 0,1583 stdev 0,0000 0,0934 0,6868 0,6967 min 0,0000 0,0000 0,0137 0,0138 max 0,0000 4,6041 22,4101 22,4102 median 0,0000 0,0001 0,0216 0,0217 There are still some huge values, and the queuing time was "lucky" not to be influenced by the still large runtimes of requests. Aside of the top values, one sees sometimes unexplained delays of a few seconds. which seem to block everything. A large source for latencies is the file system. On our production system we already invested some efforts to tune the file system in the past, such we don't see effects as large as on this sample site. Since NaviServer should work well also on non-tuned file systems, i implemented an async writer thread, that decouples writing to files (access.log and error.log) from the connection threads. The interface is the same as unix write(), but does not block. I have this running since yesterday evening on the sample site, and do far, it seems to help to reduce the random latencies significantly. i'll wait with posting data until i have a similar time range to compare, maybe the good times i am seeing are accidentally. I have not yet looked into the details for handling the async writer withing logrotate and shutdown. On another front, i was experimenting with jemalloc and tcmalloc. Previously we had the following figures, using tcl as distributed with zippy malloc with minthreads=2, maxthreads=2 requests 10307 spools 1280 queued 2704 connthreads 11 rss 425 When changing minthreads to 5, maxthreads 10, but using jemalloc, one sees the following figures requests 3933 spools 188 queued 67 connthreads 7 rss 488 requests 6325 spools 256 queued 90 connthreads 9 rss 528 requests 10021 spools 378 queued 114 connthreads 14 rss 530 It is interesting to see, that with always 5 connections threads running and using jemalloc, we see a rss consumption only slightly larger than with plain tcl and zippy malloc having maxthreads == 2, having less requests queued. Similarly, with tcmalloc we see with minthreads to 5, maxthreads 10 requests 2062 spools 49 queued 3 connthreads 6 rss 376 requests 7743 spools 429 queued 359 connthreads 11 rss 466 requests 8389 spools 451 queued 366 connthreads 12 rss 466 which is even better. For more information on malloc tests see https://next-scripting.org/xowiki/docs/misc/thread-mallocs or the tcl-core mailing list. That's all for now -gustaf neumann Am 20.11.12 20:07, schrieb Gustaf Neumann: Dear all, The idea of controlling the number of running threads via queuing latency is interesting, but i have to look into the details before i can comment on this. before one can consider controlling the number of running threads via queuing latency, one has to improve the awareness in NaviServer about the various phases in the requests lifetime. In the experimental version, we have now the following time stamps recorded: - acceptTime (the time, a socket was accepted) - requestQueueTime (the time the request was queued; was startTime) - requestDequeueTime (the time the request was dequeued) The difference between requestQueueTime and acceptTime is the setup cost and depends on the amount of work, the driver does. For instance, nssock of naviserver performs read-ahead, while nsssl does not and passes connection right away. So, the previously used startTime (which is acctually the time the request was queued) was for drivers with read ahead not correct. In the experimental version, [ns_conn start] returns now always the accept time. The next paragraph uses the term endTime, which is the time, when a connection thread is done with a request (either the content was delivered, or the content was handed over to a writer thread). The difference between requestDequeueTime and requestQueueTime is the time spent in the queue. The difference between endTime and requestDequeueTime is the pure runtime, the difference between endTime and acceptTime is the totalTime. As a rough approximation the time between requestDequeueTime and acceptTime is pretty much influenced by the server setup, and the runtime by the application. I used the term "approximation" since the runtime of certain other requests influences the queuing time, as we see in the following: Consider a server with two running connection threads receiving 6 requests, where requests 2-5 are received in a very short time. The first three requests are directly assign to connection threads, have fromQueue == 0. These have queuing times between 88 and 110 micro seconds, which includes signal sending/receiving, thread change, and initial setup in the connection thread. The runtimes for these requests are pretty bad, in the range of 0.24 to 3.8 seconds elapsed time. [1] waiting 0 current 2 idle 1 ncons 999 fromQueue 0 accept 0.000000 queue 0.000110 run 0.637781 total 0.637891 [2] waiting 3 current 2 idle 0 ncons 998 fromQueue 0 accept 0.000000 queue 0.000090 run 0.245030 total 0.245120 [3] waiting 2 current 2 idle 0 ncons 987 fromQueue 0 accept 0.000000 queue 0.000088 run 0.432421 total 0.432509 [4] waiting 1 current 2 idle 0 ncons 997 fromQueue 1 accept 0.000000 queue 0.244246 run 0.249208 total 0.493454 [5] waiting 0 current 2 idle 0 ncons 986 fromQueue 1 accept 0.000000 queue 0.431545 run 3.713331 total 4.144876 [6] waiting 0 current 2 idle 0 ncons 996 fromQueue 1 accept 0.000000 queue 0.480382 run 3.799818 total 4.280200 Requests [4, 5, 6] are queued, and have queuing times between 0.2 and 0.5 seconds. The queuing times are pretty much the runtimes of [2, 3, 4], therefore the runtime determines the queuing time. For example, the totalTime of request [4] was 0.493454 secs, half of the time it was waiting in the queue. Request [4] can consider itself happy that it was not scheduled after [5] or [6], where its totalTime would have been likely in the range of 4 secs (10 times slower). Low waiting times are essential for good performance. This example shows pretty well the importance of aync delivery mechanisms like the writer thread or bgdelivery in OpenACS. A file being delivered by the connection thread over a slow internet connection might block later requests substantially (as in the cases above). This is even more important for todays web-sites, where a single view might entail 60+ embedded requests for js, css, images, .... where it is not feasible to defined hundreds of connection threads. Before going further into detail i'll provide further introspection mechanism to the experimental version. - [ns_server stats] adding total waiting time - [ns_conn queuetime] ... time spent in queue - [ns_conn dequeue] ... time stamp when the req starts to actually run (similar to [ns_conn start]) The first can be used for server monitoring, the next two for single connection introspection. The queuetime can be useful for better awareness and for optional output in the access log, and the dequeue time-stamp for application level profiling as base for a difference with current time. Further wishes, suggestions, comments? -gustaf neumann |
From: Stephen D. <sd...@gm...> - 2012-12-04 21:06:21
|
On Thu, Nov 29, 2012 at 6:51 PM, Gustaf Neumann <ne...@wu...> wrote: > > It turned out > that the large queueing time came from requests from taipeh, which contained > several 404 errors. The size of the 404 request is 727 bytes, and therefore > under the writersize, which was configured as 1000. The delivery of an error > message takes to this site more than a second. Funny enough, the delivery of > the error message blocked the connection thread longer than the delivery of > the image when it is above the writersize. > > I will reduce the writersize further, but still a slow delivery can even > slow down the delivery of the headers, which happens still in the connection > thread. This shouldn't be the case for strings, or data sent from the fast path cache, such as a small file (a custom 404), as eventually those should work their way down to Ns_ConnWriteData which will construct the headers if not already sent and pass them, along with the data payload to writev(2). Linux should coalesce the buffers and send in a single packet, if small enough. I wonder if this is some kind of weird nsssl interaction. (For things like sendfile without ssl we could use TCP_CORK to coalesce the headers with the body) |
From: Jeff R. <dv...@di...> - 2012-11-29 19:25:17
|
Gustaf Neumann wrote: > Funny enough, the delivery of the error message blocked the connection > thread longer than the delivery of the image when it is above the > writersize. > > I will reduce the writersize further, but still a slow delivery can even > slow down the delivery of the headers, which happens still in the > connection thread. Most likely, i won't address this now. however, i > will check the effects of fastpath cache, which was deactivated so far... Hi Gustaf, This has been some really interesting bits you've been sharing. I keep meaning to add in my thoughts but I haven't had the time. One quick idea on the writer thread is to, regardless of size always make one write attempt in the conn thread, and if less than the complete buffer was written, then pass the remainder off to the writer thread. This would get the best of both worlds - fast/small requests don't incur the overhead of moving the duplicating the buffers, while large/slow requests wouldn't block the conn thread. -J |
From: Gustaf N. <ne...@wu...> - 2012-11-30 13:51:38
|
Am 29.11.12 20:24, schrieb Jeff Rogers: > > Hi Gustaf, > > One quick idea on the writer thread is to, regardless of size always > make one write attempt in the conn thread, and if less than the > complete buffer was written, then pass the remainder off to the writer > thread. This would get the best of both worlds - fast/small requests > don't incur the overhead of moving the duplicating the buffers, while > large/slow requests wouldn't block the conn thread. yes, that might be interesting to investigate as well. one has to differentiation between synchronous and asynchronous drivers and between file-deliveries and string-deliveries, etc. so in the general case, it might be some more work. However, i am still in the process to clean up and address some strange interactions (e.g. for nsssl some socket closing interactions between driver and connection threads seems to complex to me), so i am still for a while busy with that. i hope, that i don't create too much collateral damage and let you know, when the new stuff is stabilized... -gustaf neumann -- Univ.Prof. Dr. Gustaf Neumann Institute of Information Systems and New Media WU Vienna Augasse 2-6, A-1090 Vienna, AUSTRIA |
From: Gustaf N. <ne...@wu...> - 2012-12-03 10:39:20
|
Am 29.11.12 19:51, schrieb Gustaf Neumann: > However, i am still in the process to clean up and address > some strange interactions (e.g. for nsssl some socket > closing interactions between driver and connection threads > seems to complex to me), so i am still for a while busy with that The problem in the bad interaction was that the driver poll reported, there should be data to read from the socket, while the nsssl receive said, there is none and vice versa. Since the nsssl was defined as sync, the socket processing happens in the connection threads via NsGetRequest(), which returned frequently empty, meaning a useless wakeup and ConnRun etc. in the connection thread. The reason for this was that the nsssl sets the socket to non-blocking without defining the driver as ASYNC, so that it was possible that a receive returned 0 bytes. Setting the driver simply to ASYNC did not help either, since the driver used its own event loop via Ns_SockTimedWait(), which interacts badly with the drivers poll loop (did not realize on Linux that a socket becomes readable, while this worked on Mac OS X). So i changed the receive operation to simply report the read state back to the main driver, which works now apparently ok (tested on Mac OS X and Linux). To be on the save side, i have tagged the previous version of nsssl on bitbucket and incremented the version number. This change led as well to a simplification in the main driver. The new version is running since sat on next-scripting.org and shows even better results. The average queue time is lower by again a factor of 4, the standard deviation is as well much better. accept queue run total avg 0,0599 0,0002 0,0183 0,0196 stdev 0,3382 0,0039 0,0413 0,0530 min 0,0000 0,0001 0,0011 0,0012 max 12,1002 0,2324 0,8958 2,0112 median 0,0088 0,0001 0,0118 0,0120 the accept times show now up since nsssl is async, but these are often not meaningful for the performance (due to closewait the accept might be a previous accept operation). now, all performance measurement happens relative to queue time (similar as in naviserver in the main tip). Note, that this data is based not on a synthetic test but on a real-world web site with potentially different traffic patterns from day to day, so don't draw too much conclusions from this. By moving from sync to async some potentially time consuming processing (waiting for more data) is moved from the connection thread to the driver's event loop, so the time span in the connection thread is reduced (for nsssl the same way as it was before for nssock). Therefore the "runtime" has less randomness and is shorter, leading to the improvements. The behavior of the async log writer is now configurable (by default turned off), log-rolling and shutdown should now wort with this reliable as well. All changes are on bitbucket (nsssl and naviserver-connthreadqueue). The server measures now as well the filter time separately from the runtime, but i am just in the process to collect data on that... -gustaf neumann > one more update. Here is now the data of a full day with the > Async-Writer thread. The avg queueing time dropped from 0,0547 to > 0,0009, the standard deviation from 0,7481 to 0,0211. This is a > noticable improvement. > accept queue run total > avg 0,0000 0,0009 0,1304 0,1313 > stdev 0,0000 0,0211 0,9022 0,9027 > min 0,0000 0,0000 0,0032 0,0033 > max 0,0000 0,9772 28,1699 28,1700 > median 0,0000 0,0001 0,0208 0,0209 > > But still, the sometime large values are worth to investigate. It > turned out that the large queueing time came from requests from > taipeh, which contained several 404 errors. The size of the 404 > request is 727 bytes, and therefore under the writersize, which was > configured as 1000. The delivery of an error message takes to this > site more than a second. Funny enough, the delivery of the error > message blocked the connection thread longer than the delivery of the > image when it is above the writersize. > > I will reduce the writersize further, but still a slow delivery can > even slow down the delivery of the headers, which happens still in the > connection thread. Most likely, i won't address this now. however, i > will check the effects of fastpath cache, which was deactivated so far... > > best regards > -gustaf neumann > > Am 28.11.12 11:38, schrieb Gustaf Neumann: >> Dear all, >> >> here is a short update of the findings and changes since last week. >> >> One can now activate for nslog "logpartialtimes", which adds an entry >> to the access log containing the partial request times (accept time, >> queuing time, and run time). The sample site runs now with minthreads >> == 5, maxthreads 10. >> >> If one analyzes the data of one day of our sample site, one can see >> that in this case, the accept time is always 0 (this is caused by >> nsssl, which is a sync driver, header parsing happens solely in the >> connection thread), the queue time is on average 54 ms >> (milliseconds), which is huge, while the median is 0.1 ms. The reason >> is that we can see a maximum queue time of 21 secs! As a consequence, >> the standard deviation is as well huge. When one looks more closely >> to the log file, one can see that at times, when the queuing time is >> huge, as well the runtime is often huge, leading to cascading effects >> as described below. cause and reason are hard to determine, since we >> saw large runtimes even on delivery of rather small times >> accept queue run total >> avg 0,0000 0,0547 0,2656 0,3203 >> stdev 0,0000 0,7481 0,9740 1,4681 >> min 0,0000 0,0000 0,0134 0,0135 >> max 0,0000 21,9594 16,6905 29,1668 >> median 0,0000 0,0001 0,0329 0,0330 >> >> The causes for the "random" slow cases are at least from two sources: >> slow delivery (some clients are slow in retrieving data, especially >> some bots seems to be bad) and system specific latencies (e.g. backup >> running, other servers becoming active, etc.). To address this >> problem, i moved more file deliveries to the writer thread (reducing >> writersize) and activated TCP_DEFER_ACCEPT. One sees significant >> improvements, the average queuing time is 4.6 ms, but still the >> standard deviation is huge, caused by some "erratic" some large times. >> accept queue run total >> avg 0,0000 0,0046 0,1538 0,1583 >> stdev 0,0000 0,0934 0,6868 0,6967 >> min 0,0000 0,0000 0,0137 0,0138 >> max 0,0000 4,6041 22,4101 22,4102 >> median 0,0000 0,0001 0,0216 0,0217 >> >> There are still some huge values, and the queuing time was "lucky" >> not to be influenced by the still large runtimes of requests. Aside >> of the top values, one sees sometimes unexplained delays of a few >> seconds. which seem to block everything. >> >> A large source for latencies is the file system. On our production >> system we already invested some efforts to tune the file system in >> the past, such we don't see effects as large as on this sample site. >> Since NaviServer should work well also on non-tuned file systems, i >> implemented an async writer thread, that decouples writing to files >> (access.log and error.log) from the connection threads. The interface >> is the same as unix write(), but does not block. >> >> I have this running since yesterday evening on the sample site, and >> do far, it seems to help to reduce the random latencies >> significantly. i'll wait with posting data until i have a similar >> time range to compare, maybe the good times i am seeing are accidentally. >> >> I have not yet looked into the details for handling the async writer >> withing logrotate and shutdown. >> >> On another front, i was experimenting with jemalloc and tcmalloc. >> Previously we had the following figures, using tcl as distributed >> with zippy malloc >> >> with minthreads=2, maxthreads=2 >> requests 10307 spools 1280 queued 2704 connthreads 11 rss 425 >> >> When changing minthreads to 5, maxthreads 10, but using jemalloc, >> one sees the following figures >> requests 3933 spools 188 queued 67 connthreads 7 rss 488 >> requests 6325 spools 256 queued 90 connthreads 9 rss 528 >> requests 10021 spools 378 queued 114 connthreads 14 rss 530 >> It is interesting to see, that with always 5 connections threads >> running and using jemalloc, we see a rss consumption only slightly >> larger than with plain tcl and zippy malloc having maxthreads == 2, >> having less requests queued. >> >> Similarly, with tcmalloc we see with minthreads to 5, maxthreads 10 >> requests 2062 spools 49 queued 3 connthreads 6 rss 376 >> requests 7743 spools 429 queued 359 connthreads 11 rss 466 >> requests 8389 spools 451 queued 366 connthreads 12 rss 466 >> which is even better. >> >> For more information on malloc tests see >> https://next-scripting.org/xowiki/docs/misc/thread-mallocs >> or the tcl-core mailing list. >> >> That's all for now >> >> -gustaf neumann >> >> Am 20.11.12 20:07, schrieb Gustaf Neumann: >>> Dear all, >>> >>>> The idea of controlling the number of running threads via queuing >>>> latency is interesting, but i have to look into the details before >>>> i can comment on this. >>> before one can consider controlling the number of running threads >>> via queuing latency, one has to improve the awareness in NaviServer >>> about the various phases in the requests lifetime. >>> >>> In the experimental version, we have now the following time stamps >>> recorded: >>> >>> - acceptTime (the time, a socket was accepted) >>> - requestQueueTime (the time the request was queued; was startTime) >>> - requestDequeueTime (the time the request was dequeued) >>> >>> The difference between requestQueueTime and acceptTime is the setup >>> cost and depends on the amount of work, the driver does. For >>> instance, nssock of naviserver performs read-ahead, while nsssl does >>> not and passes connection right away. So, the previously used >>> startTime (which is acctually the time the request was queued) was >>> for drivers with read ahead not correct. In the experimental >>> version, [ns_conn start] returns now always the accept time. >>> >>> The next paragraph uses the term endTime, which is the time, when a >>> connection thread is done with a request (either the content was >>> delivered, or the content was handed over to a writer thread). >>> >>> The difference between requestDequeueTime and requestQueueTime is >>> the time spent in the queue. The difference between endTime and >>> requestDequeueTime is the pure runtime, the difference between >>> endTime and acceptTime is the totalTime. As a rough approximation >>> the time between requestDequeueTime and acceptTime is pretty much >>> influenced by the server setup, and the runtime by the application. >>> I used the term "approximation" since the runtime of certain other >>> requests influences the queuing time, as we see in the following: >>> >>> Consider a server with two running connection threads receiving 6 >>> requests, where requests 2-5 are received in a very short time. The >>> first three requests are directly assign to connection threads, have >>> fromQueue == 0. These have queuing times between 88 and 110 micro >>> seconds, which includes signal sending/receiving, thread change, and >>> initial setup in the connection thread. The runtimes for these >>> requests are pretty bad, in the range of 0.24 to 3.8 seconds elapsed >>> time. >>> [1] waiting 0 current 2 idle 1 ncons 999 fromQueue 0 accept 0.000000 queue 0.000110 run 0.637781 total 0.637891 >>> [2] waiting 3 current 2 idle 0 ncons 998 fromQueue 0 accept 0.000000 queue 0.000090 run 0.245030 total 0.245120 >>> [3] waiting 2 current 2 idle 0 ncons 987 fromQueue 0 accept 0.000000 queue 0.000088 run 0.432421 total 0.432509 >>> [4] waiting 1 current 2 idle 0 ncons 997 fromQueue 1 accept 0.000000 queue 0.244246 run 0.249208 total 0.493454 >>> [5] waiting 0 current 2 idle 0 ncons 986 fromQueue 1 accept 0.000000 queue 0.431545 run 3.713331 total 4.144876 >>> [6] waiting 0 current 2 idle 0 ncons 996 fromQueue 1 accept 0.000000 queue 0.480382 run 3.799818 total 4.280200 >>> >>> Requests [4, 5, 6] are queued, and have queuing times between 0.2 >>> and 0.5 seconds. The queuing times are pretty much the runtimes of >>> [2, 3, 4], therefore the runtime determines the queuing time. For >>> example, the totalTime of request [4] was 0.493454 secs, half of the >>> time it was waiting in the queue. Request [4] can consider itself >>> happy that it was not scheduled after [5] or [6], where its >>> totalTime would have been likely in the range of 4 secs (10 times >>> slower). Low waiting times are essential for good performance. >>> >>> This example shows pretty well the importance of aync delivery >>> mechanisms like the writer thread or bgdelivery in OpenACS. A file >>> being delivered by the connection thread over a slow internet >>> connection might block later requests substantially (as in the cases >>> above). This is even more important for todays web-sites, where a >>> single view might entail 60+ embedded requests for js, css, images, >>> .... where it is not feasible to defined hundreds of connection threads. >>> >>> Before going further into detail i'll provide further introspection >>> mechanism to the experimental version. >>> >>> - [ns_server stats] adding total waiting time >>> - [ns_conn queuetime] ... time spent in queue >>> - [ns_conn dequeue] ... time stamp when the req starts to actually >>> run (similar to [ns_conn start]) >>> >>> The first can be used for server monitoring, the next two for single >>> connection introspection. The queuetime can be useful for better >>> awareness and for optional output in the access log, and the dequeue >>> time-stamp for application level profiling as base for a difference >>> with current time. >>> >>> Further wishes, suggestions, comments? >>> >>> -gustaf neumann >>> |
From: Gustaf N. <ne...@wu...> - 2012-12-04 17:21:32
|
Am 29.11.12 19:51, schrieb Gustaf Neumann: > I will reduce the writersize further, but still a slow delivery can > even slow down the delivery of the headers, which happens still in the > connection thread. Most likely, i won't address this now. however, i > will check the effects of fastpath cache, which was deactivated so far... It turned out that the interactions between the various delivery methods of naviserver are more differentiated than i expected. As one can see from the graphics below, the content returning infrastructure of naviserver depends is very differentiated. Look below at the call graph. * The boxed nodes are the tcl commands, the pink nodes denote handling by the writer thread. * Of course, the ns_writer command can be used on the application layer and tries always delivery via the writer thread. * The implicit calls (e.g. ns_returnfile) go through FastReturn(), which calls Ns_ConnReturnOpenFd() only, when neither caching or mmap are activated. As soon one of these are activated, ns_return etc. will never use the writer thread. * All of the error messages or redirects use the path via Ns_ConnWriteChars. * Only content sent via Ns_ConnWriteVChars has the chance to get compressed. * Similarly, range requests are not handled when the data is not sent ReturnOpen to the writer Queue. * The nodes with the gray background are marked as deprecated * The nodes with the grey font are not called at all * The nodes with the blue font are in the driver, and not further analyzed. To make it short: * caching and mmap deactivate the implicit delivery via the writer * the non-200 return messages are always delivered via the connection thread; reducing the writersize does not make any difference. * there is quite some potential to simplify / orthogonalize the servers infrastructure. * improving this structure has nothing to do with naviserver-connthreadqueue, and should happen at some time in the main tip. -gustaf neumann |
From: Stephen D. <sd...@gm...> - 2012-12-04 19:06:52
|
On Tue, Dec 4, 2012 at 5:21 PM, Gustaf Neumann <ne...@wu...> wrote: > > * Only content sent via Ns_ConnWriteVChars has the chance to get > compressed. > ie. dynamic content with a text/* mime-type. The idea here was you don't want to try and compress gifs an so on, and static content could be pre-compressed on disk - at runtime simple look for a *.gz version of the content. This could be cleaned up a bit by: - having an extendable white-list of mime-types which should be compressed: text/*, application/javascript, application/xml etc. - we should actually ship some code which searches for *.gz versions of static files > * Similarly, range requests are not handled when the data is not sent > ReturnOpen to the writer Queue. > The diagram shows Ns_ConnReturnData also calls ReturnRange, and hence the other leg of fastpath and all the main data sending routines should handle range requests. > > * there is quite some potential to simplify / orthogonalize the servers > infrastructure. > * improving this structure has nothing to do with > naviserver-connthreadqueue, and should happen at some time in the main tip. > The writer thread was one of the last bits of code to land before things quietened down, and a lot of the stuff that got talked about didn't get implemented. One thing that was mentioned was having a call-back interface where you submit a function to the writer thread and it runs it. This would allow other kinds of requests to be served async. One of the things we've been talking about with the connthread work is simplification. The current code, with it's workarounds for stalls and managing thread counts is very complicated. If it were simplified and genericised it could also be used for background writer threads, and SSL read-ahead threads (as in aolserver > 4.5). So, that's another +1 for keeping the conn threads simple. |
From: Stephen D. <sd...@gm...> - 2012-12-04 19:25:54
|
On Mon, Dec 3, 2012 at 10:38 AM, Gustaf Neumann <ne...@wu...> wrote: > > All changes are on bitbucket (nsssl and naviserver-connthreadqueue). I found this nifty site the other day: https://www.ssllabs.com/ssltest/analyze.html?d=next-scripting.org It's highlighting a few things that need fixed in the nsssl module, including a couple of security bugs. Looks like relatively little code though. Also, there's this: https://insouciant.org/tech/ssl-performance-case-study/ which is a pretty good explanation of things from a performance point of view. I haven't spent much time looking at SSL. Looks like there could be some big wins. For example, some of the stuff to do with certificate chains could probably be automated - the server could spit out an informative error to the log if things look poorly optimised. |
From: Gustaf N. <ne...@wu...> - 2012-12-04 22:24:55
|
Am 04.12.12 20:25, schrieb Stephen Deasey: > I found this nifty site the other day: > > https://www.ssllabs.com/ssltest/analyze.html?d=next-scripting.org > > It's highlighting a few things that need fixed in the nsssl module, > including a couple of security bugs. Looks like relatively little code > though. The report is already much better: now everything is green. Most of the complaints could be removed via configuration, just two issues required code changes (requires a flag, which is not available in all current OpenSSL implementation, such as that from Mac OS X, and adding a callback). The security rating is now better than from nginx. Today, i was hunting another problem in connection with nsssl, which turns out to be a weakness of our interfaces. The source for the problem is that the buffer management of OpenSSL is not aligned with the buffer management in naviserver. In the naviserver driver, all receive requests are triggered via the poll, when sockets are readable. With OpenSSL it might be as well possible that data as a leftover from an earlier receive when a smaller buffer is provided. Naviserver requested during upload spool reveive operations with a 4KB buffer. OpenSSL might receive "at once" 16KB. The read operation with the small buffer will not drain the OpenSSL buffer, and later, poll() will not be triggered by the fact, that the socket is readable (since the buffer is still quite full). The problem happened in NaviServer, when the input was spooled (e.g. file uploads). I have doubts that this combination ever worked. I have corrected the problem by increasing the buffer variable in the driver.c. The cleaner implementation would be to add an "Ns_DriverReadableProc Readable" similar to the "Ns_DriverKeepProc Keep", but that would effect the interface of all drivers. -gustaf neumann |
From: Stephen D. <sd...@gm...> - 2012-12-06 12:04:33
|
On Tue, Dec 4, 2012 at 10:24 PM, Gustaf Neumann <ne...@wu...> wrote: > > Today, i was hunting another problem in connection with nsssl, which > turns out to be a weakness of our interfaces. The source for the problem > is that the buffer management of OpenSSL is not aligned with the buffer > management in naviserver. In the naviserver driver, all receive requests > are triggered via the poll, when sockets are readable. With OpenSSL it > might be as well possible that data as a leftover from an earlier > receive when a smaller buffer is provided. Naviserver requested during > upload spool reveive operations with a 4KB buffer. OpenSSL might receive > "at once" 16KB. The read operation with the small buffer will not drain > the OpenSSL buffer, and later, poll() will not be triggered by the fact, > that the socket is readable (since the buffer is still quite full). The > problem happened in NaviServer, when the input was spooled (e.g. file > uploads). I have doubts that this combination ever worked. I have > corrected the problem by increasing the buffer variable in the driver.c. > The cleaner implementation would be to add an "Ns_DriverReadableProc > Readable" similar to the "Ns_DriverKeepProc Keep", but that would > effect the interface of all drivers. Another way to use the openssl library is to manage socket read/writes yourself and hand memory buffers to openssl to encrypt/decrypt. |
From: Gustaf N. <ne...@wu...> - 2012-12-04 22:55:37
|
Am 04.12.12 20:06, schrieb Stephen Deasey: > - we should actually ship some code which searches for *.gz versions > of static files this would mean to keep a .gz version and a non-.gz version in the file system for the cases, where gzip is not an accepted encoding. Not sure, i would like to manage these files and to keep it in sync.... the fast-path cache could keep gzipped copies, invalidation is already there. > > * Similarly, range requests are not handled when the data is not > sent ReturnOpen to the writer Queue. > > > The diagram shows Ns_ConnReturnData also calls ReturnRange, and hence > the other leg of fastpath and all the main data sending routines > should handle range requests. this path is ok. when neither mmap or cache is set, fastpath can call ReturnOpenFd, and ReturnOpen send the data blindly to the writer if configured, which does not handle ranges. This needs some refactoring. > > * there is quite some potential to simplify / orthogonalize the > servers infrastructure. > * improving this structure has nothing to do with > naviserver-connthreadqueue, and should happen at some time in the > main tip. > > > The writer thread was one of the last bits of code to land before > things quietened down, and a lot of the stuff that got talked about > didn't get implemented. i am not complaining, just trying to understand the historical layers. Without the call-graph the current code is hard to follow. > One thing that was mentioned was having a call-back interface where > you submit a function to the writer thread and it runs it. This would > allow other kinds of requests to be served async. > > One of the things we've been talking about with the connthread work is > simplification. The current code, with it's workarounds for stalls and > managing thread counts is very complicated. If it were simplified and > genericised it could also be used for background writer threads, and > SSL read-ahead threads (as in aolserver > 4.5). So, that's another +1 > for keeping the conn threads simple. The code in naviserver-connthreadqueue handles already read-aheads with SSL. i have removed there these hacks already; i think, these were in part responsible for the sometimes erratic response times with SSL. -gustaf neuamnn |
From: Stephen D. <sd...@gm...> - 2012-12-04 23:29:31
|
On Tue, Dec 4, 2012 at 10:55 PM, Gustaf Neumann <ne...@wu...> wrote: > > The code in naviserver-connthreadqueue handles already read-aheads with SSL. > i have removed there these hacks already; i think, these were in part > responsible for the sometimes erratic response times with SSL. Well, I think the thing here is one-upon-a-time SSL was considered computationally expensive (I don't know if it still is, with recent Intel cpus having dedicated AES instructions etc.). Read-ahead is good because you don't want an expensive conn thread waiting around for the whole request to arrive, packet by packet. But with SSL the single driver thread will be decrypting read-ahead data for multiple sockets and may run out of cpu, stalling the request pipeline, starving the conn threads. By making the SSL driver thread non-async you lose out on read-ahead as that all happens on the conn thread, but you gain cpu resources on a multi-cpu system (all of them, today). AOLserver 4.5 added a pool of read-ahead threads, one per-socket IIRC, to keep the benefits of read-ahead while gaining cpu parallelism. - does a single driver thread have enough computational resources to decrypt all sockets currently in read-ahead? This is going to depend on the algorithm. Might want to favour AES if you know your cpu has support. - which is worse, losing read-ahead, or losing cpu-parallelism? - if a read-ahead thread-pool is added, should it be one thread per-socket, which is simple, or one thread per-cpu and some kind of balancing mechanism? |
From: Stephen D. <sd...@gm...> - 2012-12-04 23:42:17
|
On Wed, Nov 28, 2012 at 10:38 AM, Gustaf Neumann <ne...@wu...> wrote: > > It is interesting to see, that with always 5 connections threads running and > using jemalloc, we see a rss consumption only slightly larger than with > plain tcl and zippy malloc having maxthreads == 2, having less requests > queued. > > Similarly, with tcmalloc we see with minthreads to 5, maxthreads 10 > > requests 2062 spools 49 queued 3 connthreads 6 rss 376 > requests 7743 spools 429 queued 359 connthreads 11 rss 466 > requests 8389 spools 451 queued 366 connthreads 12 rss 466 > > which is even better. Min/max threads 5/10 better than 2/10? How about 7/10? When you hit 10/10 you can delete an awful lot of code :-) |
From: Gustaf N. <ne...@wu...> - 2012-12-05 01:12:04
|
Am 05.12.12 00:41, schrieb Stephen Deasey: > On Wed, Nov 28, 2012 at 10:38 AM, Gustaf Neumann <ne...@wu...> wrote: >> It is interesting to see, that with always 5 connections threads running and >> using jemalloc, we see a rss consumption only slightly larger than with >> plain tcl and zippy malloc having maxthreads == 2, having less requests >> queued. >> >> Similarly, with tcmalloc we see with minthreads to 5, maxthreads 10 >> >> requests 2062 spools 49 queued 3 connthreads 6 rss 376 >> requests 7743 spools 429 queued 359 connthreads 11 rss 466 >> requests 8389 spools 451 queued 366 connthreads 12 rss 466 >> >> which is even better. > Min/max threads 5/10 better than 2/10? the numbers show that 5/10 with tcmalloc is better than 5/10 with jemalloc and only slghtly worse than 2/2 with zippymalloc. -gn |
From: Stephen D. <sd...@gm...> - 2012-12-06 12:14:17
|
On Tue, Dec 4, 2012 at 10:55 PM, Gustaf Neumann <ne...@wu...> wrote: > Am 04.12.12 20:06, schrieb Stephen Deasey: > > - we should actually ship some code which searches for *.gz versions of > static files > > this would mean to keep a .gz version and a non-.gz version in the file > system for the cases, where gzip is not an accepted encoding. Not sure, i > would like to manage these files and to keep it in sync.... the fast-path > cache could keep gzipped copies, invalidation is already there. I guess it depends on how the website is deployed: in a more modern set-up CSS is often compiled from SASS or LESS; javascript needs to be minified and combined, possibly compiled using Google's optmising compiler, maybe from coffee script; images are compressed, etc. Making gzip versions of static text/* files is just one more target in a Makefile. Which is a little different than the old PHP/OpenACS perspective where everything happens at run-time. |
From: Gustaf N. <ne...@wu...> - 2012-12-06 13:15:52
|
On 06.12.12 13:13, Stephen Deasey wrote: > I guess it depends on how the website is deployed: in a more modern > set-up CSS is often compiled from SASS or LESS; javascript needs to be > minified and combined, possibly compiled using Google's optmising > compiler, maybe from coffee script; images are compressed, etc. Making > gzip versions of static text/* files is just one more target in a > Makefile. Which is a little different than the old PHP/OpenACS > perspective where everything happens at run-time. Modern PHP/OpenACS installations use reverse proxies like nginx for static content, where one has the option to compress files on the fly or to deliver pre-compressed binaries. When we switched our production site to gzip delivery for the dynamic content, we did not notice any difference in cpu-load. Sure, delivering static gziped content is faster than zipping on the fly, but i would like to keep the burden on the site master low. Not sure, why we are discussing this now. My original argument was that the api-structure for deliveries is overly complicated (to put it mildly) and not orthogonal (i failed to understand it without drawing the call-graph). There is a lot of room for improvement. -gustaf neumann |