Whe have setup a xcat hierarchical cluster, and have noticed an issue where the SSL listener processes dies. specifically it gets cannot fork errors, and then exits. We can reproduce this with a lot of requests such as:
for i in
seq 1 40; do ./getpostscript.awk > /dev/null & done
This pretty reliably causes the error. We see a storm like this on our production system when we boot large numbers of nodes.. Also once the process has dies, restarting the xcatd service fails pretty quickly as all of the nodes are now retrying the request. This does not seem to happen on the management node, but it seems to handle the requests one at a time, and is very slow. We have modified the xcatd script to not immediately call 'die' when a fork fails, and the process seems to continue handling other requests. we have checked various ulimits, and socket counts and cannot find any resource that is not available.