SLIBI is a logging and monitoring tool for InfiniBand networks using the Open Fabrics Enterprise Distribution (OFED) stack, principally directed at High Performance Computing (HPC) environments.
Its purpose is to allow continual monitoring of error rates on ports, together with the collection of performance data such as throughput. This allows the following to be automatically identified during normal cluster operation:
Hung switches (due to typical IB networks' high degeneracy, this can go unnoticed on a cluster).
Faulty links: cards, cables, switch ports, etc.
It also allows certain metadata about a fabric to be held:
Keep track of any ports deliberately disabled because they are logging errors, or are unused. The InfiniBand switches we have seen have a horrible habit of enabling disabled ports after a switch power cycle.
Associate names with InfiniBand entities, allow easy identification of what ports are connected to.
It is hoped that later analysis of throughput data may help in capacity planning. Or at least pretty pictures.
The software has been in production use on a cluster running Red Hat Enterprise Linux 5 with Mellanox QDR host cards and switches. The InfiniBand fabric has 449 hosts, 2464 switch ports and is running the minhop routing algorithm.
For discussion or queries about the project, please join SLIBI Discuss.
To receive project announcements, e.g. major release notifications, please join SLIBI Announce
GNU General Public License, version 3. Copyright (C) 2010-2011 University of Leeds.