- priority: 5 --> 2
The following would be nice from a system-administrators point of view. All/most of the data required for this is already stored inside the notification system / job server databases.
2 graphical sections
* Graph display of responders at the top, with connections "lighting up" between them when messages are fired.
* A "grid" display at the bottom, with each cell representing a compute node, possibly organised by cluster name. Each cell would be coloured, depending on what the compute node is doing, for instance:
- connected / idle
- disconnected (i.e., known to the job server, but not 'phoned home' recently)
- connected / processing
As a bonus feature, maybe some information about how quickly or how fast a compute node is compared to other nodes (eg, a big star next to super-fast nodes).
This could also be used to discover mis-configured machines - i.e. machines that have a larger than average job failure rate. Such machine could be automatically blacklisted until the problem is solved.
From a network administrator's point of view, the system might be able to discover "broken" machines - eg, if a machine was working perfectly and suddenly stops requesting jobs for a period of time, it could mean a hardware failure has occurred.