You can subscribe to this list here.
2005 |
Jan
|
Feb
(53) |
Mar
(62) |
Apr
(88) |
May
(55) |
Jun
(204) |
Jul
(52) |
Aug
|
Sep
(1) |
Oct
(94) |
Nov
(15) |
Dec
(68) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2006 |
Jan
(130) |
Feb
(105) |
Mar
(34) |
Apr
(61) |
May
(41) |
Jun
(92) |
Jul
(176) |
Aug
(102) |
Sep
(247) |
Oct
(69) |
Nov
(32) |
Dec
(140) |
2007 |
Jan
(58) |
Feb
(51) |
Mar
(11) |
Apr
(20) |
May
(34) |
Jun
(37) |
Jul
(18) |
Aug
(60) |
Sep
(41) |
Oct
(105) |
Nov
(19) |
Dec
(14) |
2008 |
Jan
(3) |
Feb
|
Mar
(7) |
Apr
(5) |
May
(123) |
Jun
(5) |
Jul
(1) |
Aug
(29) |
Sep
(15) |
Oct
(21) |
Nov
(51) |
Dec
(3) |
2009 |
Jan
|
Feb
(36) |
Mar
(29) |
Apr
|
May
|
Jun
(7) |
Jul
(4) |
Aug
|
Sep
(4) |
Oct
|
Nov
(13) |
Dec
|
2010 |
Jan
|
Feb
|
Mar
(9) |
Apr
(11) |
May
(16) |
Jun
|
Jul
|
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2012 |
Jan
(7) |
Feb
(3) |
Mar
|
Apr
|
May
|
Jun
(3) |
Jul
|
Aug
|
Sep
|
Oct
(92) |
Nov
(28) |
Dec
(16) |
2013 |
Jan
(9) |
Feb
(2) |
Mar
|
Apr
(4) |
May
(4) |
Jun
(6) |
Jul
(14) |
Aug
(12) |
Sep
(4) |
Oct
(13) |
Nov
(1) |
Dec
(6) |
2014 |
Jan
(23) |
Feb
(19) |
Mar
(10) |
Apr
(14) |
May
(11) |
Jun
(6) |
Jul
(11) |
Aug
(15) |
Sep
(41) |
Oct
(95) |
Nov
(23) |
Dec
(11) |
2015 |
Jan
(3) |
Feb
(9) |
Mar
(19) |
Apr
(3) |
May
(1) |
Jun
(3) |
Jul
(11) |
Aug
(1) |
Sep
(15) |
Oct
(5) |
Nov
(2) |
Dec
|
2016 |
Jan
(7) |
Feb
(11) |
Mar
(8) |
Apr
(1) |
May
(3) |
Jun
(17) |
Jul
(12) |
Aug
(3) |
Sep
(5) |
Oct
(19) |
Nov
(12) |
Dec
(6) |
2017 |
Jan
(30) |
Feb
(23) |
Mar
(12) |
Apr
(32) |
May
(27) |
Jun
(7) |
Jul
(13) |
Aug
(16) |
Sep
(6) |
Oct
(11) |
Nov
|
Dec
(12) |
2018 |
Jan
(1) |
Feb
(5) |
Mar
(6) |
Apr
(7) |
May
(23) |
Jun
(3) |
Jul
(2) |
Aug
(1) |
Sep
(6) |
Oct
(6) |
Nov
(10) |
Dec
(3) |
2019 |
Jan
(26) |
Feb
(15) |
Mar
(9) |
Apr
|
May
(8) |
Jun
(14) |
Jul
(10) |
Aug
(10) |
Sep
(4) |
Oct
(2) |
Nov
(20) |
Dec
(10) |
2020 |
Jan
(10) |
Feb
(14) |
Mar
(29) |
Apr
(11) |
May
(25) |
Jun
(21) |
Jul
(23) |
Aug
(12) |
Sep
(19) |
Oct
(6) |
Nov
(8) |
Dec
(12) |
2021 |
Jan
(29) |
Feb
(9) |
Mar
(8) |
Apr
(8) |
May
(2) |
Jun
(2) |
Jul
(9) |
Aug
(9) |
Sep
(3) |
Oct
(4) |
Nov
(12) |
Dec
(13) |
2022 |
Jan
(4) |
Feb
|
Mar
(4) |
Apr
(12) |
May
(15) |
Jun
(7) |
Jul
(10) |
Aug
(2) |
Sep
|
Oct
(1) |
Nov
(8) |
Dec
|
2023 |
Jan
(15) |
Feb
|
Mar
(23) |
Apr
(1) |
May
(2) |
Jun
(10) |
Jul
|
Aug
(22) |
Sep
(19) |
Oct
(2) |
Nov
(20) |
Dec
|
2024 |
Jan
(1) |
Feb
|
Mar
(16) |
Apr
(15) |
May
(6) |
Jun
(4) |
Jul
(1) |
Aug
(1) |
Sep
|
Oct
(13) |
Nov
(18) |
Dec
(6) |
2025 |
Jan
(12) |
Feb
|
Mar
(2) |
Apr
(1) |
May
(11) |
Jun
(5) |
Jul
(4) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
From: Gustaf N. <ne...@wu...> - 2012-12-04 17:21:32
|
Am 29.11.12 19:51, schrieb Gustaf Neumann: > I will reduce the writersize further, but still a slow delivery can > even slow down the delivery of the headers, which happens still in the > connection thread. Most likely, i won't address this now. however, i > will check the effects of fastpath cache, which was deactivated so far... It turned out that the interactions between the various delivery methods of naviserver are more differentiated than i expected. As one can see from the graphics below, the content returning infrastructure of naviserver depends is very differentiated. Look below at the call graph. * The boxed nodes are the tcl commands, the pink nodes denote handling by the writer thread. * Of course, the ns_writer command can be used on the application layer and tries always delivery via the writer thread. * The implicit calls (e.g. ns_returnfile) go through FastReturn(), which calls Ns_ConnReturnOpenFd() only, when neither caching or mmap are activated. As soon one of these are activated, ns_return etc. will never use the writer thread. * All of the error messages or redirects use the path via Ns_ConnWriteChars. * Only content sent via Ns_ConnWriteVChars has the chance to get compressed. * Similarly, range requests are not handled when the data is not sent ReturnOpen to the writer Queue. * The nodes with the gray background are marked as deprecated * The nodes with the grey font are not called at all * The nodes with the blue font are in the driver, and not further analyzed. To make it short: * caching and mmap deactivate the implicit delivery via the writer * the non-200 return messages are always delivered via the connection thread; reducing the writersize does not make any difference. * there is quite some potential to simplify / orthogonalize the servers infrastructure. * improving this structure has nothing to do with naviserver-connthreadqueue, and should happen at some time in the main tip. -gustaf neumann |
From: Gustaf N. <ne...@wu...> - 2012-12-03 10:39:20
|
Am 29.11.12 19:51, schrieb Gustaf Neumann: > However, i am still in the process to clean up and address > some strange interactions (e.g. for nsssl some socket > closing interactions between driver and connection threads > seems to complex to me), so i am still for a while busy with that The problem in the bad interaction was that the driver poll reported, there should be data to read from the socket, while the nsssl receive said, there is none and vice versa. Since the nsssl was defined as sync, the socket processing happens in the connection threads via NsGetRequest(), which returned frequently empty, meaning a useless wakeup and ConnRun etc. in the connection thread. The reason for this was that the nsssl sets the socket to non-blocking without defining the driver as ASYNC, so that it was possible that a receive returned 0 bytes. Setting the driver simply to ASYNC did not help either, since the driver used its own event loop via Ns_SockTimedWait(), which interacts badly with the drivers poll loop (did not realize on Linux that a socket becomes readable, while this worked on Mac OS X). So i changed the receive operation to simply report the read state back to the main driver, which works now apparently ok (tested on Mac OS X and Linux). To be on the save side, i have tagged the previous version of nsssl on bitbucket and incremented the version number. This change led as well to a simplification in the main driver. The new version is running since sat on next-scripting.org and shows even better results. The average queue time is lower by again a factor of 4, the standard deviation is as well much better. accept queue run total avg 0,0599 0,0002 0,0183 0,0196 stdev 0,3382 0,0039 0,0413 0,0530 min 0,0000 0,0001 0,0011 0,0012 max 12,1002 0,2324 0,8958 2,0112 median 0,0088 0,0001 0,0118 0,0120 the accept times show now up since nsssl is async, but these are often not meaningful for the performance (due to closewait the accept might be a previous accept operation). now, all performance measurement happens relative to queue time (similar as in naviserver in the main tip). Note, that this data is based not on a synthetic test but on a real-world web site with potentially different traffic patterns from day to day, so don't draw too much conclusions from this. By moving from sync to async some potentially time consuming processing (waiting for more data) is moved from the connection thread to the driver's event loop, so the time span in the connection thread is reduced (for nsssl the same way as it was before for nssock). Therefore the "runtime" has less randomness and is shorter, leading to the improvements. The behavior of the async log writer is now configurable (by default turned off), log-rolling and shutdown should now wort with this reliable as well. All changes are on bitbucket (nsssl and naviserver-connthreadqueue). The server measures now as well the filter time separately from the runtime, but i am just in the process to collect data on that... -gustaf neumann > one more update. Here is now the data of a full day with the > Async-Writer thread. The avg queueing time dropped from 0,0547 to > 0,0009, the standard deviation from 0,7481 to 0,0211. This is a > noticable improvement. > accept queue run total > avg 0,0000 0,0009 0,1304 0,1313 > stdev 0,0000 0,0211 0,9022 0,9027 > min 0,0000 0,0000 0,0032 0,0033 > max 0,0000 0,9772 28,1699 28,1700 > median 0,0000 0,0001 0,0208 0,0209 > > But still, the sometime large values are worth to investigate. It > turned out that the large queueing time came from requests from > taipeh, which contained several 404 errors. The size of the 404 > request is 727 bytes, and therefore under the writersize, which was > configured as 1000. The delivery of an error message takes to this > site more than a second. Funny enough, the delivery of the error > message blocked the connection thread longer than the delivery of the > image when it is above the writersize. > > I will reduce the writersize further, but still a slow delivery can > even slow down the delivery of the headers, which happens still in the > connection thread. Most likely, i won't address this now. however, i > will check the effects of fastpath cache, which was deactivated so far... > > best regards > -gustaf neumann > > Am 28.11.12 11:38, schrieb Gustaf Neumann: >> Dear all, >> >> here is a short update of the findings and changes since last week. >> >> One can now activate for nslog "logpartialtimes", which adds an entry >> to the access log containing the partial request times (accept time, >> queuing time, and run time). The sample site runs now with minthreads >> == 5, maxthreads 10. >> >> If one analyzes the data of one day of our sample site, one can see >> that in this case, the accept time is always 0 (this is caused by >> nsssl, which is a sync driver, header parsing happens solely in the >> connection thread), the queue time is on average 54 ms >> (milliseconds), which is huge, while the median is 0.1 ms. The reason >> is that we can see a maximum queue time of 21 secs! As a consequence, >> the standard deviation is as well huge. When one looks more closely >> to the log file, one can see that at times, when the queuing time is >> huge, as well the runtime is often huge, leading to cascading effects >> as described below. cause and reason are hard to determine, since we >> saw large runtimes even on delivery of rather small times >> accept queue run total >> avg 0,0000 0,0547 0,2656 0,3203 >> stdev 0,0000 0,7481 0,9740 1,4681 >> min 0,0000 0,0000 0,0134 0,0135 >> max 0,0000 21,9594 16,6905 29,1668 >> median 0,0000 0,0001 0,0329 0,0330 >> >> The causes for the "random" slow cases are at least from two sources: >> slow delivery (some clients are slow in retrieving data, especially >> some bots seems to be bad) and system specific latencies (e.g. backup >> running, other servers becoming active, etc.). To address this >> problem, i moved more file deliveries to the writer thread (reducing >> writersize) and activated TCP_DEFER_ACCEPT. One sees significant >> improvements, the average queuing time is 4.6 ms, but still the >> standard deviation is huge, caused by some "erratic" some large times. >> accept queue run total >> avg 0,0000 0,0046 0,1538 0,1583 >> stdev 0,0000 0,0934 0,6868 0,6967 >> min 0,0000 0,0000 0,0137 0,0138 >> max 0,0000 4,6041 22,4101 22,4102 >> median 0,0000 0,0001 0,0216 0,0217 >> >> There are still some huge values, and the queuing time was "lucky" >> not to be influenced by the still large runtimes of requests. Aside >> of the top values, one sees sometimes unexplained delays of a few >> seconds. which seem to block everything. >> >> A large source for latencies is the file system. On our production >> system we already invested some efforts to tune the file system in >> the past, such we don't see effects as large as on this sample site. >> Since NaviServer should work well also on non-tuned file systems, i >> implemented an async writer thread, that decouples writing to files >> (access.log and error.log) from the connection threads. The interface >> is the same as unix write(), but does not block. >> >> I have this running since yesterday evening on the sample site, and >> do far, it seems to help to reduce the random latencies >> significantly. i'll wait with posting data until i have a similar >> time range to compare, maybe the good times i am seeing are accidentally. >> >> I have not yet looked into the details for handling the async writer >> withing logrotate and shutdown. >> >> On another front, i was experimenting with jemalloc and tcmalloc. >> Previously we had the following figures, using tcl as distributed >> with zippy malloc >> >> with minthreads=2, maxthreads=2 >> requests 10307 spools 1280 queued 2704 connthreads 11 rss 425 >> >> When changing minthreads to 5, maxthreads 10, but using jemalloc, >> one sees the following figures >> requests 3933 spools 188 queued 67 connthreads 7 rss 488 >> requests 6325 spools 256 queued 90 connthreads 9 rss 528 >> requests 10021 spools 378 queued 114 connthreads 14 rss 530 >> It is interesting to see, that with always 5 connections threads >> running and using jemalloc, we see a rss consumption only slightly >> larger than with plain tcl and zippy malloc having maxthreads == 2, >> having less requests queued. >> >> Similarly, with tcmalloc we see with minthreads to 5, maxthreads 10 >> requests 2062 spools 49 queued 3 connthreads 6 rss 376 >> requests 7743 spools 429 queued 359 connthreads 11 rss 466 >> requests 8389 spools 451 queued 366 connthreads 12 rss 466 >> which is even better. >> >> For more information on malloc tests see >> https://next-scripting.org/xowiki/docs/misc/thread-mallocs >> or the tcl-core mailing list. >> >> That's all for now >> >> -gustaf neumann >> >> Am 20.11.12 20:07, schrieb Gustaf Neumann: >>> Dear all, >>> >>>> The idea of controlling the number of running threads via queuing >>>> latency is interesting, but i have to look into the details before >>>> i can comment on this. >>> before one can consider controlling the number of running threads >>> via queuing latency, one has to improve the awareness in NaviServer >>> about the various phases in the requests lifetime. >>> >>> In the experimental version, we have now the following time stamps >>> recorded: >>> >>> - acceptTime (the time, a socket was accepted) >>> - requestQueueTime (the time the request was queued; was startTime) >>> - requestDequeueTime (the time the request was dequeued) >>> >>> The difference between requestQueueTime and acceptTime is the setup >>> cost and depends on the amount of work, the driver does. For >>> instance, nssock of naviserver performs read-ahead, while nsssl does >>> not and passes connection right away. So, the previously used >>> startTime (which is acctually the time the request was queued) was >>> for drivers with read ahead not correct. In the experimental >>> version, [ns_conn start] returns now always the accept time. >>> >>> The next paragraph uses the term endTime, which is the time, when a >>> connection thread is done with a request (either the content was >>> delivered, or the content was handed over to a writer thread). >>> >>> The difference between requestDequeueTime and requestQueueTime is >>> the time spent in the queue. The difference between endTime and >>> requestDequeueTime is the pure runtime, the difference between >>> endTime and acceptTime is the totalTime. As a rough approximation >>> the time between requestDequeueTime and acceptTime is pretty much >>> influenced by the server setup, and the runtime by the application. >>> I used the term "approximation" since the runtime of certain other >>> requests influences the queuing time, as we see in the following: >>> >>> Consider a server with two running connection threads receiving 6 >>> requests, where requests 2-5 are received in a very short time. The >>> first three requests are directly assign to connection threads, have >>> fromQueue == 0. These have queuing times between 88 and 110 micro >>> seconds, which includes signal sending/receiving, thread change, and >>> initial setup in the connection thread. The runtimes for these >>> requests are pretty bad, in the range of 0.24 to 3.8 seconds elapsed >>> time. >>> [1] waiting 0 current 2 idle 1 ncons 999 fromQueue 0 accept 0.000000 queue 0.000110 run 0.637781 total 0.637891 >>> [2] waiting 3 current 2 idle 0 ncons 998 fromQueue 0 accept 0.000000 queue 0.000090 run 0.245030 total 0.245120 >>> [3] waiting 2 current 2 idle 0 ncons 987 fromQueue 0 accept 0.000000 queue 0.000088 run 0.432421 total 0.432509 >>> [4] waiting 1 current 2 idle 0 ncons 997 fromQueue 1 accept 0.000000 queue 0.244246 run 0.249208 total 0.493454 >>> [5] waiting 0 current 2 idle 0 ncons 986 fromQueue 1 accept 0.000000 queue 0.431545 run 3.713331 total 4.144876 >>> [6] waiting 0 current 2 idle 0 ncons 996 fromQueue 1 accept 0.000000 queue 0.480382 run 3.799818 total 4.280200 >>> >>> Requests [4, 5, 6] are queued, and have queuing times between 0.2 >>> and 0.5 seconds. The queuing times are pretty much the runtimes of >>> [2, 3, 4], therefore the runtime determines the queuing time. For >>> example, the totalTime of request [4] was 0.493454 secs, half of the >>> time it was waiting in the queue. Request [4] can consider itself >>> happy that it was not scheduled after [5] or [6], where its >>> totalTime would have been likely in the range of 4 secs (10 times >>> slower). Low waiting times are essential for good performance. >>> >>> This example shows pretty well the importance of aync delivery >>> mechanisms like the writer thread or bgdelivery in OpenACS. A file >>> being delivered by the connection thread over a slow internet >>> connection might block later requests substantially (as in the cases >>> above). This is even more important for todays web-sites, where a >>> single view might entail 60+ embedded requests for js, css, images, >>> .... where it is not feasible to defined hundreds of connection threads. >>> >>> Before going further into detail i'll provide further introspection >>> mechanism to the experimental version. >>> >>> - [ns_server stats] adding total waiting time >>> - [ns_conn queuetime] ... time spent in queue >>> - [ns_conn dequeue] ... time stamp when the req starts to actually >>> run (similar to [ns_conn start]) >>> >>> The first can be used for server monitoring, the next two for single >>> connection introspection. The queuetime can be useful for better >>> awareness and for optional output in the access log, and the dequeue >>> time-stamp for application level profiling as base for a difference >>> with current time. >>> >>> Further wishes, suggestions, comments? >>> >>> -gustaf neumann >>> |
From: Gustaf N. <ne...@wu...> - 2012-11-30 13:51:38
|
Am 29.11.12 20:24, schrieb Jeff Rogers: > > Hi Gustaf, > > One quick idea on the writer thread is to, regardless of size always > make one write attempt in the conn thread, and if less than the > complete buffer was written, then pass the remainder off to the writer > thread. This would get the best of both worlds - fast/small requests > don't incur the overhead of moving the duplicating the buffers, while > large/slow requests wouldn't block the conn thread. yes, that might be interesting to investigate as well. one has to differentiation between synchronous and asynchronous drivers and between file-deliveries and string-deliveries, etc. so in the general case, it might be some more work. However, i am still in the process to clean up and address some strange interactions (e.g. for nsssl some socket closing interactions between driver and connection threads seems to complex to me), so i am still for a while busy with that. i hope, that i don't create too much collateral damage and let you know, when the new stuff is stabilized... -gustaf neumann -- Univ.Prof. Dr. Gustaf Neumann Institute of Information Systems and New Media WU Vienna Augasse 2-6, A-1090 Vienna, AUSTRIA |
From: Maurizio M. <Mau...@sp...> - 2012-11-29 19:30:30
|
This is good engineering work. I rise my hat to that. Congrats, Maurizio From: Gustaf Neumann [mailto:ne...@wu...] Sent: 29 November 2012 19:51 To: nav...@li... Subject: Re: [naviserver-devel] naviserver with connection thread queue Dear all, one more update. Here is now the data of a full day with the Async-Writer thread. The avg queueing time dropped from 0,0547 to 0,0009, the standard deviation from 0,7481 to 0,0211. This is a noticable improvement. accept queue run total avg 0,0000 0,0009 0,1304 0,1313 stdev 0,0000 0,0211 0,9022 0,9027 min 0,0000 0,0000 0,0032 0,0033 max 0,0000 0,9772 28,1699 28,1700 median 0,0000 0,0001 0,0208 0,0209 But still, the sometime large values are worth to investigate. It turned out that the large queueing time came from requests from taipeh, which contained several 404 errors. The size of the 404 request is 727 bytes, and therefore under the writersize, which was configured as 1000. The delivery of an error message takes to this site more than a second. Funny enough, the delivery of the error message blocked the connection thread longer than the delivery of the image when it is above the writersize. I will reduce the writersize further, but still a slow delivery can even slow down the delivery of the headers, which happens still in the connection thread. Most likely, i won't address this now. however, i will check the effects of fastpath cache, which was deactivated so far... best regards -gustaf neumann Am 28.11.12 11:38, schrieb Gustaf Neumann: Dear all, here is a short update of the findings and changes since last week. One can now activate for nslog "logpartialtimes", which adds an entry to the access log containing the partial request times (accept time, queuing time, and run time). The sample site runs now with minthreads == 5, maxthreads 10. If one analyzes the data of one day of our sample site, one can see that in this case, the accept time is always 0 (this is caused by nsssl, which is a sync driver, header parsing happens solely in the connection thread), the queue time is on average 54 ms (milliseconds), which is huge, while the median is 0.1 ms. The reason is that we can see a maximum queue time of 21 secs! As a consequence, the standard deviation is as well huge. When one looks more closely to the log file, one can see that at times, when the queuing time is huge, as well the runtime is often huge, leading to cascading effects as described below. cause and reason are hard to determine, since we saw large runtimes even on delivery of rather small times accept queue run total avg 0,0000 0,0547 0,2656 0,3203 stdev 0,0000 0,7481 0,9740 1,4681 min 0,0000 0,0000 0,0134 0,0135 max 0,0000 21,9594 16,6905 29,1668 median 0,0000 0,0001 0,0329 0,0330 The causes for the "random" slow cases are at least from two sources: slow delivery (some clients are slow in retrieving data, especially some bots seems to be bad) and system specific latencies (e.g. backup running, other servers becoming active, etc.). To address this problem, i moved more file deliveries to the writer thread (reducing writersize) and activated TCP_DEFER_ACCEPT. One sees significant improvements, the average queuing time is 4.6 ms, but still the standard deviation is huge, caused by some "erratic" some large times. accept queue run total avg 0,0000 0,0046 0,1538 0,1583 stdev 0,0000 0,0934 0,6868 0,6967 min 0,0000 0,0000 0,0137 0,0138 max 0,0000 4,6041 22,4101 22,4102 median 0,0000 0,0001 0,0216 0,0217 There are still some huge values, and the queuing time was "lucky" not to be influenced by the still large runtimes of requests. Aside of the top values, one sees sometimes unexplained delays of a few seconds. which seem to block everything. A large source for latencies is the file system. On our production system we already invested some efforts to tune the file system in the past, such we don't see effects as large as on this sample site. Since NaviServer should work well also on non-tuned file systems, i implemented an async writer thread, that decouples writing to files (access.log and error.log) from the connection threads. The interface is the same as unix write(), but does not block. I have this running since yesterday evening on the sample site, and do far, it seems to help to reduce the random latencies significantly. i'll wait with posting data until i have a similar time range to compare, maybe the good times i am seeing are accidentally. I have not yet looked into the details for handling the async writer withing logrotate and shutdown. On another front, i was experimenting with jemalloc and tcmalloc. Previously we had the following figures, using tcl as distributed with zippy malloc with minthreads=2, maxthreads=2 requests 10307 spools 1280 queued 2704 connthreads 11 rss 425 When changing minthreads to 5, maxthreads 10, but using jemalloc, one sees the following figures requests 3933 spools 188 queued 67 connthreads 7 rss 488 requests 6325 spools 256 queued 90 connthreads 9 rss 528 requests 10021 spools 378 queued 114 connthreads 14 rss 530 It is interesting to see, that with always 5 connections threads running and using jemalloc, we see a rss consumption only slightly larger than with plain tcl and zippy malloc having maxthreads == 2, having less requests queued. Similarly, with tcmalloc we see with minthreads to 5, maxthreads 10 requests 2062 spools 49 queued 3 connthreads 6 rss 376 requests 7743 spools 429 queued 359 connthreads 11 rss 466 requests 8389 spools 451 queued 366 connthreads 12 rss 466 which is even better. For more information on malloc tests see https://next-scripting.org/xowiki/docs/misc/thread-mallocs or the tcl-core mailing list. That's all for now -gustaf neumann Am 20.11.12 20:07, schrieb Gustaf Neumann: Dear all, The idea of controlling the number of running threads via queuing latency is interesting, but i have to look into the details before i can comment on this. before one can consider controlling the number of running threads via queuing latency, one has to improve the awareness in NaviServer about the various phases in the requests lifetime. In the experimental version, we have now the following time stamps recorded: - acceptTime (the time, a socket was accepted) - requestQueueTime (the time the request was queued; was startTime) - requestDequeueTime (the time the request was dequeued) The difference between requestQueueTime and acceptTime is the setup cost and depends on the amount of work, the driver does. For instance, nssock of naviserver performs read-ahead, while nsssl does not and passes connection right away. So, the previously used startTime (which is acctually the time the request was queued) was for drivers with read ahead not correct. In the experimental version, [ns_conn start] returns now always the accept time. The next paragraph uses the term endTime, which is the time, when a connection thread is done with a request (either the content was delivered, or the content was handed over to a writer thread). The difference between requestDequeueTime and requestQueueTime is the time spent in the queue. The difference between endTime and requestDequeueTime is the pure runtime, the difference between endTime and acceptTime is the totalTime. As a rough approximation the time between requestDequeueTime and acceptTime is pretty much influenced by the server setup, and the runtime by the application. I used the term "approximation" since the runtime of certain other requests influences the queuing time, as we see in the following: Consider a server with two running connection threads receiving 6 requests, where requests 2-5 are received in a very short time. The first three requests are directly assign to connection threads, have fromQueue == 0. These have queuing times between 88 and 110 micro seconds, which includes signal sending/receiving, thread change, and initial setup in the connection thread. The runtimes for these requests are pretty bad, in the range of 0.24 to 3.8 seconds elapsed time. [1] waiting 0 current 2 idle 1 ncons 999 fromQueue 0 accept 0.000000 queue 0.000110 run 0.637781 total 0.637891 [2] waiting 3 current 2 idle 0 ncons 998 fromQueue 0 accept 0.000000 queue 0.000090 run 0.245030 total 0.245120 [3] waiting 2 current 2 idle 0 ncons 987 fromQueue 0 accept 0.000000 queue 0.000088 run 0.432421 total 0.432509 [4] waiting 1 current 2 idle 0 ncons 997 fromQueue 1 accept 0.000000 queue 0.244246 run 0.249208 total 0.493454 [5] waiting 0 current 2 idle 0 ncons 986 fromQueue 1 accept 0.000000 queue 0.431545 run 3.713331 total 4.144876 [6] waiting 0 current 2 idle 0 ncons 996 fromQueue 1 accept 0.000000 queue 0.480382 run 3.799818 total 4.280200 Requests [4, 5, 6] are queued, and have queuing times between 0.2 and 0.5 seconds. The queuing times are pretty much the runtimes of [2, 3, 4], therefore the runtime determines the queuing time. For example, the totalTime of request [4] was 0.493454 secs, half of the time it was waiting in the queue. Request [4] can consider itself happy that it was not scheduled after [5] or [6], where its totalTime would have been likely in the range of 4 secs (10 times slower). Low waiting times are essential for good performance. This example shows pretty well the importance of aync delivery mechanisms like the writer thread or bgdelivery in OpenACS. A file being delivered by the connection thread over a slow internet connection might block later requests substantially (as in the cases above). This is even more important for todays web-sites, where a single view might entail 60+ embedded requests for js, css, images, .... where it is not feasible to defined hundreds of connection threads. Before going further into detail i'll provide further introspection mechanism to the experimental version. - [ns_server stats] adding total waiting time - [ns_conn queuetime] ... time spent in queue - [ns_conn dequeue] ... time stamp when the req starts to actually run (similar to [ns_conn start]) The first can be used for server monitoring, the next two for single connection introspection. The queuetime can be useful for better awareness and for optional output in the access log, and the dequeue time-stamp for application level profiling as base for a difference with current time. Further wishes, suggestions, comments? -gustaf neumann |
From: Jeff R. <dv...@di...> - 2012-11-29 19:25:17
|
Gustaf Neumann wrote: > Funny enough, the delivery of the error message blocked the connection > thread longer than the delivery of the image when it is above the > writersize. > > I will reduce the writersize further, but still a slow delivery can even > slow down the delivery of the headers, which happens still in the > connection thread. Most likely, i won't address this now. however, i > will check the effects of fastpath cache, which was deactivated so far... Hi Gustaf, This has been some really interesting bits you've been sharing. I keep meaning to add in my thoughts but I haven't had the time. One quick idea on the writer thread is to, regardless of size always make one write attempt in the conn thread, and if less than the complete buffer was written, then pass the remainder off to the writer thread. This would get the best of both worlds - fast/small requests don't incur the overhead of moving the duplicating the buffers, while large/slow requests wouldn't block the conn thread. -J |
From: Gustaf N. <ne...@wu...> - 2012-11-29 18:51:44
|
Dear all, one more update. Here is now the data of a full day with the Async-Writer thread. The avg queueing time dropped from 0,0547 to 0,0009, the standard deviation from 0,7481 to 0,0211. This is a noticable improvement. accept queue run total avg 0,0000 0,0009 0,1304 0,1313 stdev 0,0000 0,0211 0,9022 0,9027 min 0,0000 0,0000 0,0032 0,0033 max 0,0000 0,9772 28,1699 28,1700 median 0,0000 0,0001 0,0208 0,0209 But still, the sometime large values are worth to investigate. It turned out that the large queueing time came from requests from taipeh, which contained several 404 errors. The size of the 404 request is 727 bytes, and therefore under the writersize, which was configured as 1000. The delivery of an error message takes to this site more than a second. Funny enough, the delivery of the error message blocked the connection thread longer than the delivery of the image when it is above the writersize. I will reduce the writersize further, but still a slow delivery can even slow down the delivery of the headers, which happens still in the connection thread. Most likely, i won't address this now. however, i will check the effects of fastpath cache, which was deactivated so far... best regards -gustaf neumann Am 28.11.12 11:38, schrieb Gustaf Neumann: > Dear all, > > here is a short update of the findings and changes since last week. > > One can now activate for nslog "logpartialtimes", which adds an entry > to the access log containing the partial request times (accept time, > queuing time, and run time). The sample site runs now with minthreads > == 5, maxthreads 10. > > If one analyzes the data of one day of our sample site, one can see > that in this case, the accept time is always 0 (this is caused by > nsssl, which is a sync driver, header parsing happens solely in the > connection thread), the queue time is on average 54 ms (milliseconds), > which is huge, while the median is 0.1 ms. The reason is that we can > see a maximum queue time of 21 secs! As a consequence, the standard > deviation is as well huge. When one looks more closely to the log > file, one can see that at times, when the queuing time is huge, as > well the runtime is often huge, leading to cascading effects as > described below. cause and reason are hard to determine, since we saw > large runtimes even on delivery of rather small times > accept queue run total > avg 0,0000 0,0547 0,2656 0,3203 > stdev 0,0000 0,7481 0,9740 1,4681 > min 0,0000 0,0000 0,0134 0,0135 > max 0,0000 21,9594 16,6905 29,1668 > median 0,0000 0,0001 0,0329 0,0330 > > The causes for the "random" slow cases are at least from two sources: > slow delivery (some clients are slow in retrieving data, especially > some bots seems to be bad) and system specific latencies (e.g. backup > running, other servers becoming active, etc.). To address this > problem, i moved more file deliveries to the writer thread (reducing > writersize) and activated TCP_DEFER_ACCEPT. One sees significant > improvements, the average queuing time is 4.6 ms, but still the > standard deviation is huge, caused by some "erratic" some large times. > accept queue run total > avg 0,0000 0,0046 0,1538 0,1583 > stdev 0,0000 0,0934 0,6868 0,6967 > min 0,0000 0,0000 0,0137 0,0138 > max 0,0000 4,6041 22,4101 22,4102 > median 0,0000 0,0001 0,0216 0,0217 > > There are still some huge values, and the queuing time was "lucky" not > to be influenced by the still large runtimes of requests. Aside of the > top values, one sees sometimes unexplained delays of a few seconds. > which seem to block everything. > > A large source for latencies is the file system. On our production > system we already invested some efforts to tune the file system in the > past, such we don't see effects as large as on this sample site. Since > NaviServer should work well also on non-tuned file systems, i > implemented an async writer thread, that decouples writing to files > (access.log and error.log) from the connection threads. The interface > is the same as unix write(), but does not block. > > I have this running since yesterday evening on the sample site, and do > far, it seems to help to reduce the random latencies significantly. > i'll wait with posting data until i have a similar time range to > compare, maybe the good times i am seeing are accidentally. > > I have not yet looked into the details for handling the async writer > withing logrotate and shutdown. > > On another front, i was experimenting with jemalloc and tcmalloc. > Previously we had the following figures, using tcl as distributed with > zippy malloc > > with minthreads=2, maxthreads=2 > requests 10307 spools 1280 queued 2704 connthreads 11 rss 425 > > When changing minthreads to 5, maxthreads 10, but using jemalloc, > one sees the following figures > requests 3933 spools 188 queued 67 connthreads 7 rss 488 > requests 6325 spools 256 queued 90 connthreads 9 rss 528 > requests 10021 spools 378 queued 114 connthreads 14 rss 530 > It is interesting to see, that with always 5 connections threads > running and using jemalloc, we see a rss consumption only slightly > larger than with plain tcl and zippy malloc having maxthreads == 2, > having less requests queued. > > Similarly, with tcmalloc we see with minthreads to 5, maxthreads 10 > requests 2062 spools 49 queued 3 connthreads 6 rss 376 > requests 7743 spools 429 queued 359 connthreads 11 rss 466 > requests 8389 spools 451 queued 366 connthreads 12 rss 466 > which is even better. > > For more information on malloc tests see > https://next-scripting.org/xowiki/docs/misc/thread-mallocs > or the tcl-core mailing list. > > That's all for now > > -gustaf neumann > > Am 20.11.12 20:07, schrieb Gustaf Neumann: >> Dear all, >> >>> The idea of controlling the number of running threads via queuing >>> latency is interesting, but i have to look into the details before i >>> can comment on this. >> before one can consider controlling the number of running threads via >> queuing latency, one has to improve the awareness in NaviServer about >> the various phases in the requests lifetime. >> >> In the experimental version, we have now the following time stamps >> recorded: >> >> - acceptTime (the time, a socket was accepted) >> - requestQueueTime (the time the request was queued; was startTime) >> - requestDequeueTime (the time the request was dequeued) >> >> The difference between requestQueueTime and acceptTime is the setup >> cost and depends on the amount of work, the driver does. For >> instance, nssock of naviserver performs read-ahead, while nsssl does >> not and passes connection right away. So, the previously used >> startTime (which is acctually the time the request was queued) was >> for drivers with read ahead not correct. In the experimental version, >> [ns_conn start] returns now always the accept time. >> >> The next paragraph uses the term endTime, which is the time, when a >> connection thread is done with a request (either the content was >> delivered, or the content was handed over to a writer thread). >> >> The difference between requestDequeueTime and requestQueueTime is the >> time spent in the queue. The difference between endTime and >> requestDequeueTime is the pure runtime, the difference between >> endTime and acceptTime is the totalTime. As a rough approximation the >> time between requestDequeueTime and acceptTime is pretty much >> influenced by the server setup, and the runtime by the application. I >> used the term "approximation" since the runtime of certain other >> requests influences the queuing time, as we see in the following: >> >> Consider a server with two running connection threads receiving 6 >> requests, where requests 2-5 are received in a very short time. The >> first three requests are directly assign to connection threads, have >> fromQueue == 0. These have queuing times between 88 and 110 micro >> seconds, which includes signal sending/receiving, thread change, and >> initial setup in the connection thread. The runtimes for these >> requests are pretty bad, in the range of 0.24 to 3.8 seconds elapsed >> time. >> [1] waiting 0 current 2 idle 1 ncons 999 fromQueue 0 accept 0.000000 queue 0.000110 run 0.637781 total 0.637891 >> [2] waiting 3 current 2 idle 0 ncons 998 fromQueue 0 accept 0.000000 queue 0.000090 run 0.245030 total 0.245120 >> [3] waiting 2 current 2 idle 0 ncons 987 fromQueue 0 accept 0.000000 queue 0.000088 run 0.432421 total 0.432509 >> [4] waiting 1 current 2 idle 0 ncons 997 fromQueue 1 accept 0.000000 queue 0.244246 run 0.249208 total 0.493454 >> [5] waiting 0 current 2 idle 0 ncons 986 fromQueue 1 accept 0.000000 queue 0.431545 run 3.713331 total 4.144876 >> [6] waiting 0 current 2 idle 0 ncons 996 fromQueue 1 accept 0.000000 queue 0.480382 run 3.799818 total 4.280200 >> >> Requests [4, 5, 6] are queued, and have queuing times between 0.2 and >> 0.5 seconds. The queuing times are pretty much the runtimes of [2, 3, >> 4], therefore the runtime determines the queuing time. For example, >> the totalTime of request [4] was 0.493454 secs, half of the time it >> was waiting in the queue. Request [4] can consider itself happy that >> it was not scheduled after [5] or [6], where its totalTime would have >> been likely in the range of 4 secs (10 times slower). Low waiting >> times are essential for good performance. >> >> This example shows pretty well the importance of aync delivery >> mechanisms like the writer thread or bgdelivery in OpenACS. A file >> being delivered by the connection thread over a slow internet >> connection might block later requests substantially (as in the cases >> above). This is even more important for todays web-sites, where a >> single view might entail 60+ embedded requests for js, css, images, >> .... where it is not feasible to defined hundreds of connection threads. >> >> Before going further into detail i'll provide further introspection >> mechanism to the experimental version. >> >> - [ns_server stats] adding total waiting time >> - [ns_conn queuetime] ... time spent in queue >> - [ns_conn dequeue] ... time stamp when the req starts to actually >> run (similar to [ns_conn start]) >> >> The first can be used for server monitoring, the next two for single >> connection introspection. The queuetime can be useful for better >> awareness and for optional output in the access log, and the dequeue >> time-stamp for application level profiling as base for a difference >> with current time. >> >> Further wishes, suggestions, comments? >> >> -gustaf neumann >> |
From: Gustaf N. <ne...@wu...> - 2012-11-28 10:38:33
|
Dear all, here is a short update of the findings and changes since last week. One can now activate for nslog "logpartialtimes", which adds an entry to the access log containing the partial request times (accept time, queuing time, and run time). The sample site runs now with minthreads == 5, maxthreads 10. If one analyzes the data of one day of our sample site, one can see that in this case, the accept time is always 0 (this is caused by nsssl, which is a sync driver, header parsing happens solely in the connection thread), the queue time is on average 54 ms (milliseconds), which is huge, while the median is 0.1 ms. The reason is that we can see a maximum queue time of 21 secs! As a consequence, the standard deviation is as well huge. When one looks more closely to the log file, one can see that at times, when the queuing time is huge, as well the runtime is often huge, leading to cascading effects as described below. cause and reason are hard to determine, since we saw large runtimes even on delivery of rather small times accept queue run total avg 0,0000 0,0547 0,2656 0,3203 stdev 0,0000 0,7481 0,9740 1,4681 min 0,0000 0,0000 0,0134 0,0135 max 0,0000 21,9594 16,6905 29,1668 median 0,0000 0,0001 0,0329 0,0330 The causes for the "random" slow cases are at least from two sources: slow delivery (some clients are slow in retrieving data, especially some bots seems to be bad) and system specific latencies (e.g. backup running, other servers becoming active, etc.). To address this problem, i moved more file deliveries to the writer thread (reducing writersize) and activated TCP_DEFER_ACCEPT. One sees significant improvements, the average queuing time is 4.6 ms, but still the standard deviation is huge, caused by some "erratic" some large times. accept queue run total avg 0,0000 0,0046 0,1538 0,1583 stdev 0,0000 0,0934 0,6868 0,6967 min 0,0000 0,0000 0,0137 0,0138 max 0,0000 4,6041 22,4101 22,4102 median 0,0000 0,0001 0,0216 0,0217 There are still some huge values, and the queuing time was "lucky" not to be influenced by the still large runtimes of requests. Aside of the top values, one sees sometimes unexplained delays of a few seconds. which seem to block everything. A large source for latencies is the file system. On our production system we already invested some efforts to tune the file system in the past, such we don't see effects as large as on this sample site. Since NaviServer should work well also on non-tuned file systems, i implemented an async writer thread, that decouples writing to files (access.log and error.log) from the connection threads. The interface is the same as unix write(), but does not block. I have this running since yesterday evening on the sample site, and do far, it seems to help to reduce the random latencies significantly. i'll wait with posting data until i have a similar time range to compare, maybe the good times i am seeing are accidentally. I have not yet looked into the details for handling the async writer withing logrotate and shutdown. On another front, i was experimenting with jemalloc and tcmalloc. Previously we had the following figures, using tcl as distributed with zippy malloc with minthreads=2, maxthreads=2 requests 10307 spools 1280 queued 2704 connthreads 11 rss 425 When changing minthreads to 5, maxthreads 10, but using jemalloc, one sees the following figures requests 3933 spools 188 queued 67 connthreads 7 rss 488 requests 6325 spools 256 queued 90 connthreads 9 rss 528 requests 10021 spools 378 queued 114 connthreads 14 rss 530 It is interesting to see, that with always 5 connections threads running and using jemalloc, we see a rss consumption only slightly larger than with plain tcl and zippy malloc having maxthreads == 2, having less requests queued. Similarly, with tcmalloc we see with minthreads to 5, maxthreads 10 requests 2062 spools 49 queued 3 connthreads 6 rss 376 requests 7743 spools 429 queued 359 connthreads 11 rss 466 requests 8389 spools 451 queued 366 connthreads 12 rss 466 which is even better. For more information on malloc tests see https://next-scripting.org/xowiki/docs/misc/thread-mallocs or the tcl-core mailing list. That's all for now -gustaf neumann Am 20.11.12 20:07, schrieb Gustaf Neumann: > Dear all, > >> The idea of controlling the number of running threads via queuing >> latency is interesting, but i have to look into the details before i >> can comment on this. > before one can consider controlling the number of running threads via > queuing latency, one has to improve the awareness in NaviServer about > the various phases in the requests lifetime. > > In the experimental version, we have now the following time stamps > recorded: > > - acceptTime (the time, a socket was accepted) > - requestQueueTime (the time the request was queued; was startTime) > - requestDequeueTime (the time the request was dequeued) > > The difference between requestQueueTime and acceptTime is the setup > cost and depends on the amount of work, the driver does. For instance, > nssock of naviserver performs read-ahead, while nsssl does not and > passes connection right away. So, the previously used startTime (which > is acctually the time the request was queued) was for drivers with > read ahead not correct. In the experimental version, [ns_conn start] > returns now always the accept time. > > The next paragraph uses the term endTime, which is the time, when a > connection thread is done with a request (either the content was > delivered, or the content was handed over to a writer thread). > > The difference between requestDequeueTime and requestQueueTime is the > time spent in the queue. The difference between endTime and > requestDequeueTime is the pure runtime, the difference between endTime > and acceptTime is the totalTime. As a rough approximation the time > between requestDequeueTime and acceptTime is pretty much influenced by > the server setup, and the runtime by the application. I used the term > "approximation" since the runtime of certain other requests influences > the queuing time, as we see in the following: > > Consider a server with two running connection threads receiving 6 > requests, where requests 2-5 are received in a very short time. The > first three requests are directly assign to connection threads, have > fromQueue == 0. These have queuing times between 88 and 110 micro > seconds, which includes signal sending/receiving, thread change, and > initial setup in the connection thread. The runtimes for these > requests are pretty bad, in the range of 0.24 to 3.8 seconds elapsed time. > [1] waiting 0 current 2 idle 1 ncons 999 fromQueue 0 accept 0.000000 queue 0.000110 run 0.637781 total 0.637891 > [2] waiting 3 current 2 idle 0 ncons 998 fromQueue 0 accept 0.000000 queue 0.000090 run 0.245030 total 0.245120 > [3] waiting 2 current 2 idle 0 ncons 987 fromQueue 0 accept 0.000000 queue 0.000088 run 0.432421 total 0.432509 > [4] waiting 1 current 2 idle 0 ncons 997 fromQueue 1 accept 0.000000 queue 0.244246 run 0.249208 total 0.493454 > [5] waiting 0 current 2 idle 0 ncons 986 fromQueue 1 accept 0.000000 queue 0.431545 run 3.713331 total 4.144876 > [6] waiting 0 current 2 idle 0 ncons 996 fromQueue 1 accept 0.000000 queue 0.480382 run 3.799818 total 4.280200 > > Requests [4, 5, 6] are queued, and have queuing times between 0.2 and > 0.5 seconds. The queuing times are pretty much the runtimes of [2, 3, > 4], therefore the runtime determines the queuing time. For example, > the totalTime of request [4] was 0.493454 secs, half of the time it > was waiting in the queue. Request [4] can consider itself happy that > it was not scheduled after [5] or [6], where its totalTime would have > been likely in the range of 4 secs (10 times slower). Low waiting > times are essential for good performance. > > This example shows pretty well the importance of aync delivery > mechanisms like the writer thread or bgdelivery in OpenACS. A file > being delivered by the connection thread over a slow internet > connection might block later requests substantially (as in the cases > above). This is even more important for todays web-sites, where a > single view might entail 60+ embedded requests for js, css, images, > .... where it is not feasible to defined hundreds of connection threads. > > Before going further into detail i'll provide further introspection > mechanism to the experimental version. > > - [ns_server stats] adding total waiting time > - [ns_conn queuetime] ... time spent in queue > - [ns_conn dequeue] ... time stamp when the req starts to actually > run (similar to [ns_conn start]) > > The first can be used for server monitoring, the next two for single > connection introspection. The queuetime can be useful for better > awareness and for optional output in the access log, and the dequeue > time-stamp for application level profiling as base for a difference > with current time. > > Further wishes, suggestions, comments? > > -gustaf neumann > |
From: Gustaf N. <ne...@wu...> - 2012-11-20 19:07:40
|
Dear all, > The idea of controlling the number of running threads via queuing > latency is interesting, but i have to look into the details before i > can comment on this. before one can consider controlling the number of running threads via queuing latency, one has to improve the awareness in NaviServer about the various phases in the requests lifetime. In the experimental version, we have now the following time stamps recorded: - acceptTime (the time, a socket was accepted) - requestQueueTime (the time the request was queued; was startTime) - requestDequeueTime (the time the request was dequeued) The difference between requestQueueTime and acceptTime is the setup cost and depends on the amount of work, the driver does. For instance, nssock of naviserver performs read-ahead, while nsssl does not and passes connection right away. So, the previously used startTime (which is acctually the time the request was queued) was for drivers with read ahead not correct. In the experimental version, [ns_conn start] returns now always the accept time. The next paragraph uses the term endTime, which is the time, when a connection thread is done with a request (either the content was delivered, or the content was handed over to a writer thread). The difference between requestDequeueTime and requestQueueTime is the time spent in the queue. The difference between endTime and requestDequeueTime is the pure runtime, the difference between endTime and acceptTime is the totalTime. As a rough approximation the time between requestDequeueTime and acceptTime is pretty much influenced by the server setup, and the runtime by the application. I used the term "approximation" since the runtime of certain other requests influences the queuing time, as we see in the following: Consider a server with two running connection threads receiving 6 requests, where requests 2-5 are received in a very short time. The first three requests are directly assign to connection threads, have fromQueue == 0. These have queuing times between 88 and 110 micro seconds, which includes signal sending/receiving, thread change, and initial setup in the connection thread. The runtimes for these requests are pretty bad, in the range of 0.24 to 3.8 seconds elapsed time. [1] waiting 0 current 2 idle 1 ncons 999 fromQueue 0 accept 0.000000 queue 0.000110 run 0.637781 total 0.637891 [2] waiting 3 current 2 idle 0 ncons 998 fromQueue 0 accept 0.000000 queue 0.000090 run 0.245030 total 0.245120 [3] waiting 2 current 2 idle 0 ncons 987 fromQueue 0 accept 0.000000 queue 0.000088 run 0.432421 total 0.432509 [4] waiting 1 current 2 idle 0 ncons 997 fromQueue 1 accept 0.000000 queue 0.244246 run 0.249208 total 0.493454 [5] waiting 0 current 2 idle 0 ncons 986 fromQueue 1 accept 0.000000 queue 0.431545 run 3.713331 total 4.144876 [6] waiting 0 current 2 idle 0 ncons 996 fromQueue 1 accept 0.000000 queue 0.480382 run 3.799818 total 4.280200 Requests [4, 5, 6] are queued, and have queuing times between 0.2 and 0.5 seconds. The queuing times are pretty much the runtimes of [2, 3, 4], therefore the runtime determines the queuing time. For example, the totalTime of request [4] was 0.493454 secs, half of the time it was waiting in the queue. Request [4] can consider itself happy that it was not scheduled after [5] or [6], where its totalTime would have been likely in the range of 4 secs (10 times slower). Low waiting times are essential for good performance. This example shows pretty well the importance of aync delivery mechanisms like the writer thread or bgdelivery in OpenACS. A file being delivered by the connection thread over a slow internet connection might block later requests substantially (as in the cases above). This is even more important for todays web-sites, where a single view might entail 60+ embedded requests for js, css, images, .... where it is not feasible to defined hundreds of connection threads. Before going further into detail i'll provide further introspection mechanism to the experimental version. - [ns_server stats] adding total waiting time - [ns_conn queuetime] ... time spent in queue - [ns_conn dequeue] ... time stamp when the req starts to actually run (similar to [ns_conn start]) The first can be used for server monitoring, the next two for single connection introspection. The queuetime can be useful for better awareness and for optional output in the access log, and the dequeue time-stamp for application level profiling as base for a difference with current time. Further wishes, suggestions, comments? -gustaf neumann |
From: Gustaf N. <ne...@wu...> - 2012-11-19 11:30:28
|
On 18.11.12 20:34, Stephen Deasey wrote: > On Sun, Nov 18, 2012 at 1:22 PM, Gustaf Neumann <ne...@wu...> wrote: >> Here are some actual figures with a comparable number of requests: >> >> with minthreads==maxthreads==2 >> requests 10182 queued 2695 connthreads 11 cpu 00:05:27 rss 415 >> >> below are the previous values, competed by the number of queuing operations >> and the rss size in MV >> >> with minthreads=2, create when queue >= 2 >> requests 10104 queued 1584 connthreads 27 cpu 00:06:14 rss 466 >> >> as anticipated, thread creations and cpu consumption went down, but the >> number of queued requests (requests that could not be executed immediately) >> increased significantly. > I was thinking of the opposite: make min/max threads equal by > increasing min threads. Requests would never stall in the queue, > unlike the experiment you ran with max threads reduced to min threads. on the site, we have maxthreads 10. so setting minthreads as well to 10 has the consequence of a larger memsize (and queued substantially reduced). > But there's another benefit: unlike the dynamic scenario requests > would also never stall in the queue when a new thread had to be > started when min < max threads. you are talking about naviserver before 4.99.4. Both, the version in the tip naviserver repository and the forked version provide already warmed up threads. The version on the main tip starts to listen to the wakup signals only once it is warmed up, the version in the fork adds a thread to the conn queue as well only after the startup is complete. So, in both cases there is no stall. In earlier version, this was as you describe. > What is the down side to increasing min threads up to max threads? higher memory consumption, maybe more open database connections, allocating resources which are not needed. The degree of wastefullness depends certainly on maxthreads. I would assume that for an admin carefully watching the server needs, setting minthreads==maxthreads to the "right value" can lead to slight improvements as long the load is rather constant over time. >> Maybe the most significant benefit of a low maxthreads value is the reduced >> memory consumption. On this machine we are using plain Tcl with its "zippy >> malloc", which does not release memory (once allocated to its pool) back to >> the OS. So, the measured memsize depends on the max number of threads with >> tcl interps, especially with large blueprints (as in the case of OpenACS). > Right: the max number of threads *ever*, not just currently. So by > killing threads you don't reduce memory usage, but you do increase > latency for some requests which have to wait for a thread+interp to be > created. not really with the warm-up feature. > Is it convenient to measure latency distribution (not just average)? I > guess not: we record conn.startTime when a connection is taken out of > the queue and passed to a conn thread, but we don't record the time > when a socket was accepted. we could record the socket accept time and measure the difference until the start of the connection runtime; when we output this to the accesslog (like logreqtime) we could run whatever statistics we want. > Actually, managing request latency is another area we don't handle so > well. You can influence it by adjusting the OS listen socket accept > queue length, you can adjust the length of the naviserver queue, and > with the proposed change here you can change how aggressive new > threads are created to process requests in the queue. But queue-depth > is a roundabout way of specifying milliseconds of latency. And not > just round-about but inherently imprecise as different URLs are going > to require different amounts of time to complete, and which URLs are > requested is a function of current traffic. If instead of queue size > you could specify a target latency then we could maybe do smarter > things with the queue, such as pull requests off the back of the queue > which have been waiting longer than the target latency, making room > for fresh requests on the front of the queue. The idea of controlling the number of running threads via queuing latency is interesting, but i have to look into the details before i can comment on this. -gustaf |
From: Stephen D. <sd...@gm...> - 2012-11-18 19:35:19
|
On Sun, Nov 18, 2012 at 1:22 PM, Gustaf Neumann <ne...@wu...> wrote: > On 14.11.12 09:51, Gustaf Neumann wrote: > > On 13.11.12 15:02, Stephen Deasey wrote: > > On Tue, Nov 13, 2012 at 11:18 AM, Gustaf Neumann <ne...@wu...> wrote: > > minthreads = 2 > > creating threads, when idle == 0 > 10468 requests, connthreads 267 > total cputime 00:10:32 > > creating threads, when queue >= 5 > requests 10104 connthreads 27 > total cputime 00:06:14 > > What if you set minthreads == maxthreads? > > The number of thread create operations will go further down. > > Here are some actual figures with a comparable number of requests: > > with minthreads==maxthreads==2 > requests 10182 queued 2695 connthreads 11 cpu 00:05:27 rss 415 > > below are the previous values, competed by the number of queuing operations > and the rss size in MV > > with minthreads=2, create when queue >= 2 > requests 10104 queued 1584 connthreads 27 cpu 00:06:14 rss 466 > > as anticipated, thread creations and cpu consumption went down, but the > number of queued requests (requests that could not be executed immediately) > increased significantly. I was thinking of the opposite: make min/max threads equal by increasing min threads. Requests would never stall in the queue, unlike the experiment you ran with max threads reduced to min threads. But there's another benefit: unlike the dynamic scenario requests would also never stall in the queue when a new thread had to be started when min < max threads. What is the down side to increasing min threads up to max threads? > Maybe the most significant benefit of a low maxthreads value is the reduced > memory consumption. On this machine we are using plain Tcl with its "zippy > malloc", which does not release memory (once allocated to its pool) back to > the OS. So, the measured memsize depends on the max number of threads with > tcl interps, especially with large blueprints (as in the case of OpenACS). Right: the max number of threads *ever*, not just currently. So by killing threads you don't reduce memory usage, but you do increase latency for some requests which have to wait for a thread+interp to be created. Is it convenient to measure latency distribution (not just average)? I guess not: we record conn.startTime when a connection is taken out of the queue and passed to a conn thread, but we don't record the time when a socket was accepted. Actually, managing request latency is another area we don't handle so well. You can influence it by adjusting the OS listen socket accept queue length, you can adjust the length of the naviserver queue, and with the proposed change here you can change how aggressive new threads are created to process requests in the queue. But queue-depth is a roundabout way of specifying milliseconds of latency. And not just round-about but inherently imprecise as different URLs are going to require different amounts of time to complete, and which URLs are requested is a function of current traffic. If instead of queue size you could specify a target latency then we could maybe do smarter things with the queue, such as pull requests off the back of the queue which have been waiting longer than the target latency, making room for fresh requests on the front of the queue. |
From: Agustin L. <Agu...@uv...> - 2012-11-18 15:32:29
|
Hello all! I am using Naviserver with our installation a pair of weeks. All Ok. The memory consumption has been reduced from 8 to 5 GB in each node. Almost no crashes reboots by memory problems. :-) I am having some problems with WebDav. Webdav should work equally with both (Aolserver / Naviserver) with the same code? Regards, Agustin |
From: Gustaf N. <ne...@wu...> - 2012-11-18 13:22:12
|
On 14.11.12 09:51, Gustaf Neumann wrote: > On 13.11.12 15:02, Stephen Deasey wrote: >> On Tue, Nov 13, 2012 at 11:18 AM, Gustaf Neumann <ne...@wu...> wrote: >>> minthreads = 2 >>> >>> creating threads, when idle == 0 >>> 10468 requests, connthreads 267 >>> total cputime 00:10:32 >>> >>> creating threads, when queue >= 5 >>> requests 10104 connthreads 27 >>> total cputime 00:06:14 >> What if you set minthreads == maxthreads? > The number of thread create operations will go further down. Here are some actual figures with a comparable number of requests: with minthreads==maxthreads==2 requests 10182 queued 2695 connthreads 11 cpu 00:05:27 rss 415 below are the previous values, competed by the number of queuing operations and the rss size in MV with minthreads=2, create when queue >= 2 requests 10104 queued 1584 connthreads 27 cpu 00:06:14 rss 466 as anticipated, thread creations and cpu consumption went down, but the number of queued requests (requests that could not be executed immediately) increased significantly. Maybe the most significant benefit of a low maxthreads value is the reduced memory consumption. On this machine we are using plain Tcl with its "zippy malloc", which does not release memory (once allocated to its pool) back to the OS. So, the measured memsize depends on the max number of threads with tcl interps, especially with large blueprints (as in the case of OpenACS). This situation can be improved with e.g. jemalloc (what we are using in production, which requires a modified tcl), but after about 2 or 3 days running a server the rss sizes are very similar (most likely due to fragmentation). -gustaf neumann > When running already at minthreads, the connection thread > timeout is ignored (otherwise there would be a high number > of thread create operations just after the timeout expires > to ensure minthreads running connection threads). With > connsperthread == 1000, there will be about 10 thread create > operations for 10000 requests (not counting the 2 initial > create operation during startup for minthreads == 2). So, > the cpu consumption will be lower, but the server would not > scale, when the requests frequency would require more > connection threads. Furthermore, there will be most likely > more requests put into the queue instead of being served > immediately. > > When we assume, that with minthreads == maxthreads == 2 > there won't be more than say 20 requests queued, a similar > effect could be achieved by allowing additional thread > creations for more than 20 requests in the queue. Or even > more conservative, allowing thread creations only when the > request queue is completely full (setting the low water mark > to 100%) would as well be better than minthreads == > maxthreads, since the server will at least start to create > additional threads in this rather hopeless situation, where > with minthreads == maxthreads, it won't. > > > > > > ------------------------------------------------------------------------------ > Monitor your physical, virtual and cloud infrastructure from a single > web console. Get in-depth insight into apps, servers, databases, vmware, > SAP, cloud infrastructure, etc. Download 30-day Free Trial. > Pricing starts from $795 for 25 servers or applications! > http://p.sf.net/sfu/zoho_dev2dev_nov > _______________________________________________ > naviserver-devel mailing list > nav...@li... > https://lists.sourceforge.net/lists/listinfo/naviserver-devel |
From: Gustaf N. <ne...@wu...> - 2012-11-14 08:52:04
|
On 13.11.12 15:02, Stephen Deasey wrote: > On Tue, Nov 13, 2012 at 11:18 AM, Gustaf Neumann <ne...@wu...> wrote: >> minthreads = 2 >> >> creating threads, when idle == 0 >> 10468 requests, connthreads 267 >> total cputime 00:10:32 >> >> creating threads, when queue >= 5 >> requests 10104 connthreads 27 >> total cputime 00:06:14 > What if you set minthreads == maxthreads? The number of thread create operations will go further down. When running already at minthreads, the connection thread timeout is ignored (otherwise there would be a high number of thread create operations just after the timeout expires to ensure minthreads running connection threads). With connsperthread == 1000, there will be about 10 thread create operations for 10000 requests (not counting the 2 initial create operation during startup for minthreads == 2). So, the cpu consumption will be lower, but the server would not scale, when the requests frequency would require more connection threads. Furthermore, there will be most likely more requests put into the queue instead of being served immediately. When we assume, that with minthreads == maxthreads == 2 there won't be more than say 20 requests queued, a similar effect could be achieved by allowing additional thread creations for more than 20 requests in the queue. Or even more conservative, allowing thread creations only when the request queue is completely full (setting the low water mark to 100%) would as well be better than minthreads == maxthreads, since the server will at least start to create additional threads in this rather hopeless situation, where with minthreads == maxthreads, it won't. |
From: Stephen D. <sd...@gm...> - 2012-11-13 14:03:10
|
On Tue, Nov 13, 2012 at 11:18 AM, Gustaf Neumann <ne...@wu...> wrote: > > minthreads = 2 > > creating threads, when idle == 0 > 10468 requests, connthreads 267 > total cputime 00:10:32 > > creating threads, when queue >= 5 > requests 10104 connthreads 27 > total cputime 00:06:14 What if you set minthreads == maxthreads? |
From: Gustaf N. <ne...@wu...> - 2012-11-13 11:18:47
|
Dear all, again some update: The mechanism sketched below works now as well in the regression test. There is now a backchannel in place that lets conn threads notify the driver to check the liveliness of the server. This backchannel makes as well the timeout based liveliness checking obsolete. By using the lowwatermark parameter to control thread creation, the resource consumption went down significantly without sacrificing speed for this setup. Here is some data from next-scripting.org, which is a rather idle site with real world traffic (including bots etc.). The server has defined minthreads = 2, is running 2 drivers (nssock + nsssl) and uses a writer thread. before (creating threads, when idle == 0, running server for 2 days) 10468 requests, connthreads 267 total cputime 00:10:32 new (creating threads, when queue >= 5, running server for 2 days) requests 10104 connthreads 27 total cputime 00:06:14 One can see, that the number of create operations for connection threads went down by a factor of 10 (from 267 to 27), and that the cpu consumption was reduced by about 40% (thread initialization costs 0.64 secs in this configuration). One can get a behavior similar to idle==0 by setting the low water mark to 0. The shutdown mechanism is now adjusted to the new infrastructure (connection threads have their own condition variable, so one cannot use the old broadcast to all conn threads anymore). -gustaf neumann Am 07.11.12 02:54, schrieb Gustaf Neumann: > Some update: after some more testing with the new code, i still think, > the version is promising, but needs a few tweaks. I have started to > address the thread creation. > > To sum up the thread creation behavior/configuration of naviserver-tip: > > - minthreads (try to keep at least minthreads threads idle) > - spread (fight against thread mass extinction due to round robin) > - threadtimeout (useless due to round robin) > - connsperthread (the only parameter effectively controlling the > lifespan of an conn thread) > - maxconnections (controls maximum number of connections in the > waiting queue, including running threads) > - concurrentcreatethreshold (percentage of waiting queue full, when > to create threads in concurrently) > > Due to the policy of keeping at least minthreads idle, threads are > preallocated when the load is high, the number of threads never falls > under minthreads by construct. Threads stop mostly due to connsperthread. > > Naviserver with thread queue (fork) > > - minthreads (try to keep at least minthreads threads idle) > - threadtimeout (works effectively, default 120 secs) > - connsperthread (as before, just not varied via spread) > - maxconnections (as before; use maybe "queuesize" instead) > - lowwatermark (new) > - highwatermark (was concurrentcreatethreshold) > > The parameter "spread" is already deleted, since the enqueueing takes > care for a certain distribution, at least, when several threads are > created. Threads are deleted often before connsperthread due to the > timeout. Experiments show furthermore, that the rather agressive > preallocation policy with minthreads idle threads causes now much more > thread destroy and thread create operations than before. With with > OpenACS, thread creation is compute-intense (about 1 sec). > > In the experimental version, connections are only queued when no > connection thread is available (the tip version places every > connection into the queue). Queueing happens with "bulky" requests, > when e.g. a view causes a bunch (on average 5, often 10+, sometimes > 50+) of requests for embedded resources (style files, javascript, > images). It seems that permitting a few queued requests is often a > good idea, since the connection threads will pick these up typically > very quickly. > > To make the aggressiveness of the thread creation policy better > configurable, the experimental version uses for this purpose solely > the number of queued requests based on two parameters: > > - lowwatermark (if the actual queue size is below this value, don't > try to create threads; default 5%) > - highwatermark (if the actual queue size is above this value, allow > parallel thread creates; default 80%) > > To increase the aggressiveness, one could set lowwatermark to e.g. 0, > causing thread-creates, whenever a connection is queued. Increasing > the lowwatermark reduces the willingness to create new threads. The > highwatermark might be useful for benchmark situations, where the > queue is filled up quickly. > > The default values seems to work quite well, it is used currently on > http://next-scripting.org. However we still need some more experiments > on different sites to get a better understanding. > > hmm final comment: for the regression test, i had to add the policy to > create threads, when all connection threads are busy. The config file > of the regression test uses connsperthread 0 (which is the default, > but not very good as such), causing the exit every connection thread > to exit after every threads. So, when the request comes in, that we > have a thread busy, but nothing queued. So, there would not be the > need to create a new thread. However, when the conn thread exists, the > single request would not be processed. > > So, much more testing is needed. > -gustaf neumann > > Am 01.11.12 20:17, schrieb Gustaf Neumann: >> Dear all, >> >> There is now a version on bitbucket, which works quite nice >> and stable, as far i can tell. I have split up the rather >> coarse lock of all pools and introduced finer locks for >> waiting queue (wqueue) and thread queue (tqueue) per pool. >> The changes lead to significant finer lock granularity and >> improve scalability. >> >> I have tested this new version with a synthetic load of 120 >> requests per seconds, some slower requests and some faster >> ones, and it appears to be pretty stable. This load keeps >> about 20 connection threads quite busy on my home machine. >> The contention of the new locks is very little: on this test >> we saw 12 busy locks on 217.000 locks on the waiting queue, >> and 9 busy locks out of 83.000 locks on the thread queue. >> These measures are much better than in current naviserver, >> which has on the same test on the queue 248.000 locks with >> 190 busy ones. The total waiting time for locks is reduced >> by a factor of 10. One has to add, that it was not so bad >> before either. The benefit will be larger when multiple >> pools are used. >> >> Finally i think, the code is clearer than before, where the >> lock duration was quite tricky to determine. >> >> opinions? >> -gustaf neumann >> >> PS: For the changes, see: >> https://bitbucket.org/gustafn/naviserver-connthreadqueue/changesets >> >> PS2: have not addressed the server exit signaling yet. >> >> On 29.10.12 13:41, Gustaf Neumann wrote: >>> A version of this is in the following fork: >>> >>> https://bitbucket.org/gustafn/naviserver-connthreadqueue/changesets >>> >>> So far, the competition on the pool mutex is quite high, but >>> i think, it can be improved. Currently primarily the pool >>> mutex is used for conn thread life-cycle management, and it >>> is needed from the main/drivers/spoolers as well from the >>> connection threads to update the idle/running/.. counters >>> needed for controlling thread creation etc. Differentiating >>> these mutexes should help. >>> >>> i have not addressed the termination signaling, but that's >>> rather simple. >>> >>> -gustaf neumann >>> >>> On 28.10.12 03:08, Gustaf Neumann wrote: >>>> i've just implemented lightweight version of the above (just >>>> a few lines of code) by extending the connThread Arg >>>> structure; .... |
From: Gustaf N. <ne...@wu...> - 2012-11-07 01:55:44
|
Some update: after some more testing with the new code, i still think, the version is promising, but needs a few tweaks. I have started to address the thread creation. To sum up the thread creation behavior/configuration of naviserver-tip: - minthreads (try to keep at least minthreads threads idle) - spread (fight against thread mass extinction due to round robin) - threadtimeout (useless due to round robin) - connsperthread (the only parameter effectively controlling the lifespan of an conn thread) - maxconnections (controls maximum number of connections in the waiting queue, including running threads) - concurrentcreatethreshold (percentage of waiting queue full, when to create threads in concurrently) Due to the policy of keeping at least minthreads idle, threads are preallocated when the load is high, the number of threads never falls under minthreads by construct. Threads stop mostly due to connsperthread. Naviserver with thread queue (fork) - minthreads (try to keep at least minthreads threads idle) - threadtimeout (works effectively, default 120 secs) - connsperthread (as before, just not varied via spread) - maxconnections (as before; use maybe "queuesize" instead) - lowwatermark (new) - highwatermark (was concurrentcreatethreshold) The parameter "spread" is already deleted, since the enqueueing takes care for a certain distribution, at least, when several threads are created. Threads are deleted often before connsperthread due to the timeout. Experiments show furthermore, that the rather agressive preallocation policy with minthreads idle threads causes now much more thread destroy and thread create operations than before. With with OpenACS, thread creation is compute-intense (about 1 sec). In the experimental version, connections are only queued when no connection thread is available (the tip version places every connection into the queue). Queueing happens with "bulky" requests, when e.g. a view causes a bunch (on average 5, often 10+, sometimes 50+) of requests for embedded resources (style files, javascript, images). It seems that permitting a few queued requests is often a good idea, since the connection threads will pick these up typically very quickly. To make the aggressiveness of the thread creation policy better configurable, the experimental version uses for this purpose solely the number of queued requests based on two parameters: - lowwatermark (if the actual queue size is below this value, don't try to create threads; default 5%) - highwatermark (if the actual queue size is above this value, allow parallel thread creates; default 80%) To increase the aggressiveness, one could set lowwatermark to e.g. 0, causing thread-creates, whenever a connection is queued. Increasing the lowwatermark reduces the willingness to create new threads. The highwatermark might be useful for benchmark situations, where the queue is filled up quickly. The default values seems to work quite well, it is used currently on http://next-scripting.org. However we still need some more experiments on different sites to get a better understanding. hmm final comment: for the regression test, i had to add the policy to create threads, when all connection threads are busy. The config file of the regression test uses connsperthread 0 (which is the default, but not very good as such), causing the exit every connection thread to exit after every threads. So, when the request comes in, that we have a thread busy, but nothing queued. So, there would not be the need to create a new thread. However, when the conn thread exists, the single request would not be processed. So, much more testing is needed. -gustaf neumann Am 01.11.12 20:17, schrieb Gustaf Neumann: > Dear all, > > There is now a version on bitbucket, which works quite nice > and stable, as far i can tell. I have split up the rather > coarse lock of all pools and introduced finer locks for > waiting queue (wqueue) and thread queue (tqueue) per pool. > The changes lead to significant finer lock granularity and > improve scalability. > > I have tested this new version with a synthetic load of 120 > requests per seconds, some slower requests and some faster > ones, and it appears to be pretty stable. This load keeps > about 20 connection threads quite busy on my home machine. > The contention of the new locks is very little: on this test > we saw 12 busy locks on 217.000 locks on the waiting queue, > and 9 busy locks out of 83.000 locks on the thread queue. > These measures are much better than in current naviserver, > which has on the same test on the queue 248.000 locks with > 190 busy ones. The total waiting time for locks is reduced > by a factor of 10. One has to add, that it was not so bad > before either. The benefit will be larger when multiple > pools are used. > > Finally i think, the code is clearer than before, where the > lock duration was quite tricky to determine. > > opinions? > -gustaf neumann > > PS: For the changes, see: > https://bitbucket.org/gustafn/naviserver-connthreadqueue/changesets > > PS2: have not addressed the server exit signaling yet. > > On 29.10.12 13:41, Gustaf Neumann wrote: >> A version of this is in the following fork: >> >> https://bitbucket.org/gustafn/naviserver-connthreadqueue/changesets >> >> So far, the competition on the pool mutex is quite high, but >> i think, it can be improved. Currently primarily the pool >> mutex is used for conn thread life-cycle management, and it >> is needed from the main/drivers/spoolers as well from the >> connection threads to update the idle/running/.. counters >> needed for controlling thread creation etc. Differentiating >> these mutexes should help. >> >> i have not addressed the termination signaling, but that's >> rather simple. >> >> -gustaf neumann >> >> On 28.10.12 03:08, Gustaf Neumann wrote: >>> i've just implemented lightweight version of the above (just >>> a few lines of code) by extending the connThread Arg >>> structure; .... >> ------------------------------------------------------------------------------ >> The Windows 8 Center - In partnership with Sourceforge >> Your idea - your app - 30 days. >> Get started! >> http://windows8center.sourceforge.net/ >> what-html-developers-need-to-know-about-coding-windows-8-metro-style-apps/ >> _______________________________________________ >> naviserver-devel mailing list >> nav...@li... >> https://lists.sourceforge.net/lists/listinfo/naviserver-devel > > ------------------------------------------------------------------------------ > Everyone hates slow websites. So do we. > Make your web apps faster with AppDynamics > Download AppDynamics Lite for free today: > http://p.sf.net/sfu/appdyn_sfd2d_oct > _______________________________________________ > naviserver-devel mailing list > nav...@li... > https://lists.sourceforge.net/lists/listinfo/naviserver-devel -- Univ.Prof. Dr. Gustaf Neumann Institute of Information Systems and New Media WU Vienna Augasse 2-6, A-1090 Vienna, AUSTRIA |
From: Jeff R. <dv...@di...> - 2012-11-07 01:25:46
|
Ok, this must be something weird about my build. I'll ignore it for now. Thanks -J Gustaf Neumann wrote: > Jeff, are you sure, you are setting the parameter correctly in you > config file? It seems to work for me (Linux FC 17). > > clone(child_stack=0x7f912f2b6fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f912f2b79d0, tls=0x7f912f2b7700, child_tidptr=0x7f912f2b79d0) = 11862 > mmap(NULL, 135168, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f9116212000 > mprotect(0x7f9116212000, 4096, PROT_NONE) = 0 > clone(child_stack=0x7f9116231fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f91162329d0, tls=0x7f9116232700, child_tidptr=0x7f91162329d0) = 11863 > mmap(NULL, 135168, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f9115f94000 > mprotect(0x7f9115f94000, 4096, PROT_NONE) = 0 > clone(child_stack=0x7f9115fb3fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f9115fb49d0, tls=0x7f9115fb4700, child_tidptr=0x7f9115fb49d0) = 11864 > > -gustaf > > Am 05.11.12 22:56, schrieb Jeff Rogers: >> Hi all, >> >> I've been trying to track down why my naviserver processes have such a >> large memory footprint compared to similarly configured aolserver >> processes. One culprit I already found is the pre-allocation of >> per-conn compression streams (the fixes to this have already been >> mentioned, tho not yet committed). >> >> The other culprit is strange. It appears that naviserver is completely >> ignoring the stacksize parameter, and starting all threads with the >> default size, which on linux is the maximum stack size at process start >> time, 8Mb by default. This isn't a case of the wrong parameter being >> set, it appears to be a behavior of the nsthread library. nsthreadtest >> shows the same behavior, even after it calls >> >> Ns_ThreadStackSize(81920); >> >> I've added a debug printf to pthread.c to double-check what is being >> passed as the stacksize parameter, and it's the expected value in all >> cases, it's just not used for some reason. I've been using strace to >> peek at what's going on, and nsthreadtest in naviserver looks like >> >> mmap2(NULL, 8392704, PROT_READ|PROT_WRITE, >> MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0xb4bfa000 >> mprotect(0xb4bfa000, 4096, PROT_NONE) = 0 >> clone(child_stack=0xb53fa494, ... >> >> while in aolserver it's >> >> mmap2(NULL, 90112, PROT_READ|PROT_WRITE, >> MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0xb748b000 >> mprotect(0xb748b000, 4096, PROT_NONE) = 0 >> clone(child_stack=0xb74a0494, ... >> >> >> As you can see it's an ~8M vs 90k (81920 + guard pages) allocation. >> >> Does anyone else on linux see the same behavior, or should I be looking >> at local differences? >> >> Second, the naviserver and aolserver nsthread libraries aren't >> identical, but they aren't very different. Any ideas what would be >> causing this behavior? >> >> -J >> >> ------------------------------------------------------------------------------ >> LogMeIn Central: Instant, anywhere, Remote PC access and management. >> Stay in control, update software, and manage PCs from one command center >> Diagnose problems and improve visibility into emerging IT issues >> Automate, monitor and manage. Do more in less time with Central >> http://p.sf.net/sfu/logmein12331_d2d >> _______________________________________________ >> naviserver-devel mailing list >> nav...@li... >> https://lists.sourceforge.net/lists/listinfo/naviserver-devel > > |
From: Gustaf N. <ne...@wu...> - 2012-11-06 20:05:03
|
Jeff, are you sure, you are setting the parameter correctly in you config file? It seems to work for me (Linux FC 17). clone(child_stack=0x7f912f2b6fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f912f2b79d0, tls=0x7f912f2b7700, child_tidptr=0x7f912f2b79d0) = 11862 mmap(NULL, 135168, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f9116212000 mprotect(0x7f9116212000, 4096, PROT_NONE) = 0 clone(child_stack=0x7f9116231fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f91162329d0, tls=0x7f9116232700, child_tidptr=0x7f91162329d0) = 11863 mmap(NULL, 135168, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f9115f94000 mprotect(0x7f9115f94000, 4096, PROT_NONE) = 0 clone(child_stack=0x7f9115fb3fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f9115fb49d0, tls=0x7f9115fb4700, child_tidptr=0x7f9115fb49d0) = 11864 -gustaf Am 05.11.12 22:56, schrieb Jeff Rogers: > Hi all, > > I've been trying to track down why my naviserver processes have such a > large memory footprint compared to similarly configured aolserver > processes. One culprit I already found is the pre-allocation of > per-conn compression streams (the fixes to this have already been > mentioned, tho not yet committed). > > The other culprit is strange. It appears that naviserver is completely > ignoring the stacksize parameter, and starting all threads with the > default size, which on linux is the maximum stack size at process start > time, 8Mb by default. This isn't a case of the wrong parameter being > set, it appears to be a behavior of the nsthread library. nsthreadtest > shows the same behavior, even after it calls > > Ns_ThreadStackSize(81920); > > I've added a debug printf to pthread.c to double-check what is being > passed as the stacksize parameter, and it's the expected value in all > cases, it's just not used for some reason. I've been using strace to > peek at what's going on, and nsthreadtest in naviserver looks like > > mmap2(NULL, 8392704, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0xb4bfa000 > mprotect(0xb4bfa000, 4096, PROT_NONE) = 0 > clone(child_stack=0xb53fa494, ... > > while in aolserver it's > > mmap2(NULL, 90112, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0xb748b000 > mprotect(0xb748b000, 4096, PROT_NONE) = 0 > clone(child_stack=0xb74a0494, ... > > > As you can see it's an ~8M vs 90k (81920 + guard pages) allocation. > > Does anyone else on linux see the same behavior, or should I be looking > at local differences? > > Second, the naviserver and aolserver nsthread libraries aren't > identical, but they aren't very different. Any ideas what would be > causing this behavior? > > -J > > ------------------------------------------------------------------------------ > LogMeIn Central: Instant, anywhere, Remote PC access and management. > Stay in control, update software, and manage PCs from one command center > Diagnose problems and improve visibility into emerging IT issues > Automate, monitor and manage. Do more in less time with Central > http://p.sf.net/sfu/logmein12331_d2d > _______________________________________________ > naviserver-devel mailing list > nav...@li... > https://lists.sourceforge.net/lists/listinfo/naviserver-devel -- Univ.Prof. Dr. Gustaf Neumann Institute of Information Systems and New Media WU Vienna Augasse 2-6, A-1090 Vienna, AUSTRIA |
From: Jeff R. <dv...@di...> - 2012-11-05 21:56:45
|
Hi all, I've been trying to track down why my naviserver processes have such a large memory footprint compared to similarly configured aolserver processes. One culprit I already found is the pre-allocation of per-conn compression streams (the fixes to this have already been mentioned, tho not yet committed). The other culprit is strange. It appears that naviserver is completely ignoring the stacksize parameter, and starting all threads with the default size, which on linux is the maximum stack size at process start time, 8Mb by default. This isn't a case of the wrong parameter being set, it appears to be a behavior of the nsthread library. nsthreadtest shows the same behavior, even after it calls Ns_ThreadStackSize(81920); I've added a debug printf to pthread.c to double-check what is being passed as the stacksize parameter, and it's the expected value in all cases, it's just not used for some reason. I've been using strace to peek at what's going on, and nsthreadtest in naviserver looks like mmap2(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0xb4bfa000 mprotect(0xb4bfa000, 4096, PROT_NONE) = 0 clone(child_stack=0xb53fa494, ... while in aolserver it's mmap2(NULL, 90112, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0xb748b000 mprotect(0xb748b000, 4096, PROT_NONE) = 0 clone(child_stack=0xb74a0494, ... As you can see it's an ~8M vs 90k (81920 + guard pages) allocation. Does anyone else on linux see the same behavior, or should I be looking at local differences? Second, the naviserver and aolserver nsthread libraries aren't identical, but they aren't very different. Any ideas what would be causing this behavior? -J |
From: Gustaf N. <ne...@wu...> - 2012-11-02 12:11:19
|
On 02.11.12 12:43, Stephen Deasey wrote: > On Thu, Nov 1, 2012 at 12:13 PM, Gustaf Neumann <ne...@wu...> wrote: >> i think, that adding a command reference would be a nice >> addition, and is most likely possible with reasonable amount >> of work. The other alternative would be to split up the >> documentation pages with multiple commands into single pages... > The idea was to head in the other direction: consolidate similar > commands into one page. i fully agree that a contextual display of related commands is better. > For example for the ns_return page there's a couple of paragraphs of > explanation at the top that applies to all the commands, and by > collecting them together it's easy to see how they're similar and how > they differ. (maybe the 3 redirect commands should be split into their > own page?) > > On the other hand, some times you just want a big list of all > commands... Hopefully dtplite can be coaxed into generating it. dtplite is a tcl script that packages common behavior of the tcl doctool packages. One can certainly try to extend dtplte or we can roll our own doc generator using these packages, but this requires some deeper diving into the doctool universe. In recent tcllibraries there are besides the doctool oacages the doctools v2 package, which does not make the best path more obvious, since dtplite does not use v2 (to my shallow understanding). -gustaf neumann |
From: Stephen D. <sd...@gm...> - 2012-11-02 11:44:11
|
On Thu, Nov 1, 2012 at 12:13 PM, Gustaf Neumann <ne...@wu...> wrote: > > i think, that adding a command reference would be a nice > addition, and is most likely possible with reasonable amount > of work. The other alternative would be to split up the > documentation pages with multiple commands into single pages... The idea was to head in the other direction: consolidate similar commands into one page. For example for the ns_return page there's a couple of paragraphs of explanation at the top that applies to all the commands, and by collecting them together it's easy to see how they're similar and how they differ. (maybe the 3 redirect commands should be split into their own page?) On the other hand, some times you just want a big list of all commands... Hopefully dtplite can be coaxed into generating it. |
From: Stephen D. <sd...@gm...> - 2012-11-02 11:32:53
|
On Wed, Oct 31, 2012 at 12:54 PM, Stefan Sobernig <ste...@wu...> wrote: > > I'd also appreciate a heads-up on a previous thread: > > http://www.mail-archive.com/nav...@li.../msg01811.html > > Was the idea of a "default" method ("*" or the like) for the > ns_register_* cmd family ever implemented? No. Looks like some surgery to nsd/urlspace.c might be required to alter the current order of lookup from: method, path, path for the url GET /foo/bar to: path, path, method, so that if the path matches and the method does not, a default method handler can be returned. |
From: Stephen D. <sd...@gm...> - 2012-11-02 10:44:18
|
On Thu, Nov 1, 2012 at 4:03 PM, David Osborne <da...@qc...> wrote: > > Trying to work out if this is a config error or a code difference... any > pointers? In general, add this to your config: ns_section "ns/parameters" ns_param logdev true And the log file will tell you exactly which config parameters are being accessed, with which values and types, what the defaults are, etc. It's worthwhile doing this just because config files tend to accumulate cruft after a while. |
From: Gustaf N. <ne...@wu...> - 2012-11-02 10:04:12
|
Hi David, i think, this is intentional. all config options for serving static files are under ns/server/${servername}/fastpath see: https://bitbucket.org/naviserver/naviserver/src/d34fc2681cf1fb4dec89b7172d09a187c37efe32/sample-config.tcl.in?at=default#cl-301 -gustaf On 02.11.12 10:22, David Osborne wrote: > Thanks Jeff and Gustaf. > > I found that setting pagedir had no effect under the > server section (as is in the newly changed > tests/http-test-config.tcl file) eg: > > ns_section "ns/server/${server_name}" > ns_param pagedir $pageroot > > But if I added a subsequent fastpath section and added it > there instead then it worked. Is this what would be expected? > > ns_section "ns/server/${server_name}/fastpath" > ns_param pagedir $pageroot > > > On 1 November 2012 19:51, Gustaf Neumann <ne...@wu... > <mailto:ne...@wu...>> wrote: > > > hmm, i found two occurrences of "nsparam pageroot", and > changed these. > > % egrep -R "ns_param *pageroot" * > openacs-config.tcl: ns_param pageroot > $pageroot > tests/http-test-config.tcl:ns_param pageroot $pageroot > > i have added as well "ns_info pagedir" as an alias for > "ns_info pageroot". > One should mark the latter as deprecated. > > -gustaf > > > > ------------------------------------------------------------------------------ > LogMeIn Central: Instant, anywhere, Remote PC access and management. > Stay in control, update software, and manage PCs from one command center > Diagnose problems and improve visibility into emerging IT issues > Automate, monitor and manage. Do more in less time with Central > http://p.sf.net/sfu/logmein12331_d2d > > > _______________________________________________ > naviserver-devel mailing list > nav...@li... > https://lists.sourceforge.net/lists/listinfo/naviserver-devel |
From: David O. <da...@qc...> - 2012-11-02 09:49:37
|
Thanks Jeff and Gustaf. I found that setting pagedir had no effect under the server section (as is in the newly changed tests/http-test-config.tcl file) eg: ns_section "ns/server/${server_name}" ns_param pagedir $pageroot But if I added a subsequent fastpath section and added it there instead then it worked. Is this what would be expected? ns_section "ns/server/${server_name}/fastpath" ns_param pagedir $pageroot On 1 November 2012 19:51, Gustaf Neumann <ne...@wu...> wrote: > > hmm, i found two occurrences of "nsparam pageroot", and > changed these. > > % egrep -R "ns_param *pageroot" * > openacs-config.tcl: ns_param pageroot $pageroot > tests/http-test-config.tcl:ns_param pageroot $pageroot > > i have added as well "ns_info pagedir" as an alias for > "ns_info pageroot". > One should mark the latter as deprecated. > > -gustaf > |