We are too dependent on a functional DNS.
The Wiki RM module keeps lists of host communications
by hostname, but prolly should use IP address for all
internal needs.
The Node Daemon uses the hostname:
1) To determine the task ID of the process to launch.
2) For starting MPI jobs, only launching the process
if the current node is the "head node" of the job.
(This relies on the executed "mpirun" to correctly
distribute the job itself).
For TCP (sched --> node) communication, the scheduler
(Wiki RM module) could handle #1&2 above by simply
telling the appropriate nodes what their taskid is and
start or alloc command.
For Multicasting (sched --> nodes) communication,
handling #1 is complicated since each node is a
different ID, but each node gets the exact same
multicast message! In fact I'm not sure if there is a
better way to handle it than by hostname or IP address
as it is already #2 could be handled by a hybrid
approach: TCP to head node and multicast to the rest of
them.
...
If the scheduler internals used IP addresses instead of
hostnames, the only real gotcha would be how to handle
multi-homed hosts. But I think we would still want to
use DNS or /etc/hosts for "pretty printing" in user output.
Logged In: YES
user_id=362364
OR ...
We may want to "fix" this problem by even being more
"promiscuous" with names. So far the real world problem is
that someone is running a node daemon & scheduler on the
same node with both of these refering to themselves by
different names. Dunno why. If there were a way to
determine all names for a node then maybe it would work.
Except that if DNS has multiple hosts resolve to the same
name, it would potentially wreck job-starting havoc!