We have a problem with a httpd proxy-server (httpd 2.4.25 with mod_qos 11.39) causing under heavy load to block all incoming connections as soon as QS_ClientPrefer limit is reached. It seems the counter of "concurrent connections" is not decremented correctly.
When the first mod_qos(066) event was triggered the number of concurrent connections was 959. From this point on, the counter was getting only higher (up to 13431) though MaxRequestWorkers is 1024.
Yes, Apache 2.4 (especially when using the MPM event module) sometimes waits a few seconds until calling the connection cleanup method (freeing the allocated memory) which decrements the counter. This may be the reason why you see a higher number of connections in the case that many connections are opened / closed in a very short period of time.
Last edit: Pascal Buchbinder 2018-02-05
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Ok, but on this system the "concurrent connections" counter didn't go down any more, so that all further connections were blocked. As you can see in the log excerpt the time span was over 10 hours.
The only place in the code that decrements the counter is the one in qos_cleanup_conn ()
The generation id is set when starting or restarting (usr1) the server and each forked worker process use the id to verify that the shared memory belongs to the same version (a new shared memory structure is created at usr1 before forking new child processed). The cause might be that this check is only present when decrementing the counter but not when incrementing (but I would be very surprised, if a dying process still gets new connections - but maybe this is some when changed in any Apache 2.4 version). The other possibility are process crashes, but I assume you already checked the log to ensure that this is not the case, do you.
I'm going to write some test cases (particular for Apache 2.4) and will introduce the generation check at counter increment - just to be sure....
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Questions:
1. you don't see child processes terminating unexpectedly (child pid ... exit signal)?
2. are you doing graceful restarts when this happens (kill -usr1)?
3. what's your MaxConnectionsPerChild setting?
Last edit: Pascal Buchbinder 2018-02-06
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
yes, the system was up about 30 days and there were some graceful restarts in between (due to configuration changes)
MaxConnectionsPerChild is set to 1000000
btw, I did some testing with graceful restarts and forced child exits but couldn't reproduce the misbehavior. I think massive load plays an important role here...
Last edit: Armin Abfalterer 2018-02-08
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I suggest you try to apply the patch shown above (I've even tested and committed the change but not yet build a new release - I could not reproduce the issue neither, but I assume the change does not make anything worse)
I did some tests with HTTP/2 and could see that the "connection counter" didn't go back to zero when there were no more connections on the proxy. However, I still couldn't reproduce that the counter is constantly growing, even with HTTP/2.
Anyway, thanks for the improvements... I'll give 11.49 a try.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi
We have a problem with a httpd proxy-server (httpd 2.4.25 with mod_qos 11.39) causing under heavy load to block all incoming connections as soon as QS_ClientPrefer limit is reached. It seems the counter of "concurrent connections" is not decremented correctly.
When the first mod_qos(066) event was triggered the number of concurrent connections was 959. From this point on, the counter was getting only higher (up to 13431) though MaxRequestWorkers is 1024.
Regards, Armin
Yes, Apache 2.4 (especially when using the MPM event module) sometimes waits a few seconds until calling the connection cleanup method (freeing the allocated memory) which decrements the counter. This may be the reason why you see a higher number of connections in the case that many connections are opened / closed in a very short period of time.
Last edit: Pascal Buchbinder 2018-02-05
Hi Pascal, thanks for your reply!
Ok, but on this system the "concurrent connections" counter didn't go down any more, so that all further connections were blocked. As you can see in the log excerpt the time span was over 10 hours.
The only place in the code that decrements the counter is the one in qos_cleanup_conn ()
I could imagine something went wrong with the generation counter m_generation. What do you think?
BTW, the proxy-server uses worker MPM.
The generation id is set when starting or restarting (usr1) the server and each forked worker process use the id to verify that the shared memory belongs to the same version (a new shared memory structure is created at usr1 before forking new child processed). The cause might be that this check is only present when decrementing the counter but not when incrementing (but I would be very surprised, if a dying process still gets new connections - but maybe this is some when changed in any Apache 2.4 version). The other possibility are process crashes, but I assume you already checked the log to ensure that this is not the case, do you.
I'm going to write some test cases (particular for Apache 2.4) and will introduce the generation check at counter increment - just to be sure....
this could be patch worth trying if it fixed the problem:
ok, with that patch the counter might not be increased anymore. but the problem that it isn't counted down might still exist, or how do you see it?
If don't increment, we don't have to decrement.
Questions:
1. you don't see child processes terminating unexpectedly (child pid ... exit signal)?
2. are you doing graceful restarts when this happens (kill -usr1)?
3. what's your MaxConnectionsPerChild setting?
Last edit: Pascal Buchbinder 2018-02-06
btw, I did some testing with graceful restarts and forced child exits but couldn't reproduce the misbehavior. I think massive load plays an important role here...
Last edit: Armin Abfalterer 2018-02-08
I suggest you try to apply the patch shown above (I've even tested and committed the change but not yet build a new release - I could not reproduce the issue neither, but I assume the change does not make anything worse)
https://sourceforge.net/p/mod-qos/source/2367/tree//trunk/httpd_src/modules/qos/mod_qos.c?diff=2359
I also propose to run the server in the QS_LogOnly mode until you are sure the problem is solved by this change.
Last edit: Pascal Buchbinder 2018-02-08
As of Apache 2.4.18 (Nov/Dec 2015), the concept of a master connection (representing a "real" TCP connection) having multiple slave connections has been introduced https://github.com/apache/httpd/commit/8d9cdb307a76c3f854fa144df3e6dfce3787937e#diff-1981f07af9efc86b50f10c114a0a9f78
I belive this causes the problem.
@Armin: can you confirm that you are using HTTP/2?
Last edit: Pascal Buchbinder 2018-02-09
Pascal, yes - the proxy uses mod_http2.
I've now release mod_qos 11.49 with the intention to improve the h2 compatibility.
Last edit: Pascal Buchbinder 2018-02-10
I did some tests with HTTP/2 and could see that the "connection counter" didn't go back to zero when there were no more connections on the proxy. However, I still couldn't reproduce that the counter is constantly growing, even with HTTP/2.
Anyway, thanks for the improvements... I'll give 11.49 a try.