I am having real problems with what appears to be scheduling issues,
and I am running out of ideas...
We have a cluster of 5 machines, each running about 15 UMLs. Things
seem to run great for a while, then performance of the UMLs seems to
die for a while.
My solution to this was to look at all running processes every 30
seconds, calculate which UMLs were taking up more than their fair
share of CPU time, and renice them. I am thinking now:
1) Have I missed any processes if I look only every 30 seconds? From
my experience SKAS-mode seems to use the same PIDs over and over
2) If I setpriority on a UML instance (and its children) does this do
the right thing?
I start my UMLs at setpriority 10, and drift up to 20 under heavy
load. I have code which reduces it back to 10 when the load gets
better. But my experience so far is that setpriority is not a strong
scheduling force in my system (2.6.7 hosts with bb1-complete patch
with 2.6.7 guests). 2.6.7 was the most reliable for me when I started
all this in the summer
Dont get me wrong, performance in general (less than 9 machines per
host) is pretty impressive.
Is there a better scheduler I could be using in the host perhaps?
We are using this system for teaching server administration. Students
get buttons to reboot, shutdown, etc, and join a queue to get a
machine. We have interactive tutorials which telnet to their machines
to run tests, and it is this part in particular which is causing
trouble. It seems under heavy load the ability of UMLs to respond to
incomming packets is the thing which dies first. I am using tuntap for
networking, and openvpn as the bridging technology between the hosts.
Machines are all 1GB or better, 2GHz or better, on a 0.1GB network
backbone. No noticable syslog errors. Only think I didnt
expect is a few dropped packets on the tap devices themselves. But
these are 10 or less at worst.
All suggestions welcome.