From: Michal J. <mi...@ha...> - 2003-09-07 02:51:49
|
I have a test cluster (one master and one node) where both machines are running 2.4.19-lanl.22smp from "Clustermatic 3" CD while booting is done via beoboot-cm.1.5. I am trying to get ganglia running in this setup. I got all components compiled and installed and after fiddling with a configuration a bit they even work; but there is a catch. On nodes I need to run 'gmond' and a natural thing would be to start it in /etc/beowulf/node_up. By default (when /etc/gmond.conf is absent, for example) it runs as user "nobody". Attempts to do in any time end up with "user 'nobody' does not exist". One can put a suitable /etc/gmond.conf on nodes with a help of a line like plugin miscfiles /etc/beowulf/node/gmond.conf>/etc/gmond.conf There we have two options. One is 'setuid root'. This brings "user 'root' does not exist" and gmond does not start in 'node_up' although executing that later by typing commands is fine. Also the 'node_up' does not really return and a node status ends as "error". Another possibility is 'no_setuid on' in a node configuration file. This works to an extent. 'node_up' script which looks like that: /usr/lib/beoboot/bin/node_up $* || exit 1 echo "running bpsh $1 gmond" sleep 2 bpsh $1 /usr/sbin/gmond bpsh $1 ps uwwaxf sleep 2 bpsh $1 ps uwwaxf prints the following in a log file for a node: ..... nodeup : Node setup returned status 0 running bpsh 0 gmond USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND (null) 2014 0.0 0.0 14456 1004 ? S 19:59 0:00 /usr/sbin/gmond (null) 2016 0.0 0.1 14564 1116 ? S 19:59 0:00 /usr/sbin/gmond (null) 2019 0.0 0.1 14584 1136 ? S 19:59 0:00 \_ /usr/sbin/gmond (null) 2020 0.0 0.1 14600 1160 ? R 19:59 0:00 \_ /usr/sbin/gmond (null) 2021 0.0 0.1 14600 1180 ? S 19:59 0:00 \_ /usr/sbin/gmond (null) 2022 0.0 0.1 14604 1184 ? S 19:59 0:00 \_ /usr/sbin/gmond (null) 2023 0.0 0.1 14604 1188 ? S 19:59 0:00 \_ /usr/sbin/gmond (null) 2024 0.0 0.1 14604 1188 ? S 19:59 0:00 \_ /usr/sbin/gmond USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND (null) 2014 0.0 0.1 15820 1384 ? S 19:59 0:00 /usr/sbin/gmond but if you check right after 'node_up' returned then gmond is gone. So how I am supposed to start that daemon on a node? Anybody with ideas? Attempts through a subshell in a shell wrapper, on an off-chance that we are not detaching properly from a controlling terminal, fail with a bang. I tried that also with portmap. A difference is that ps prints "#1" instead of "(null)" in a USER field. Good enough to do some NFS mounts with locking from 'node_up' but later portmap is also gone. It appears that I am lucky that this one does not seem to care about under which user id it runs. In this particular case I guess that it is possible to work around the issue by having a cron job which repeatedly starts gmond for every node which is 'up' on 'bpstat' list. This is not particularly nice, I am afraid. BTW - with beoboot-cm.1.4 I had big troubles to run from 'node_up' mostly anything at all. 1.5 is a progress in that respect. Michal |