Whe have setup a xcat hierarchical cluster, and have noticed an issue where the SSL listener processes dies. specifically it gets cannot fork errors, and then exits. We can reproduce this with a lot of requests such as:
for i in seq 1 40
; do ./getpostscript.awk > /dev/null & done
This pretty reliably causes the error. We see a storm like this on our production system when we boot large numbers of nodes.. Also once the process has dies, restarting the xcatd service fails pretty quickly as all of the nodes are now retrying the request. This does not seem to happen on the management node, but it seems to handle the requests one at a time, and is very slow. We have modified the xcatd script to not immediately call 'die' when a fork fails, and the process seems to continue handling other requests. we have checked various ulimits, and socket counts and cannot find any resource that is not available.
We patched the xcatd code to not exit on these errors, so it could handle further requests, as it seemed to work fine that way, only slower. Clients retry in this case so it worked ok.
Later on we realized that we needed overcommit of memory setting different, and we found that setting the sytsctl of
vm.overcommit_memory=0
instead of
vm.overcommit_memory=2
made the fork errors disappear.
Moved the open xCAT 2.6.6 bug to current service tream
Can you pickup the latest xCAT 2.6.10 build and see if this issue still exists .
If there is till a problem please reopen the bug.