SLIBI Code
Status: Beta
Brought to you by:
ccaamad
Simple Logging InfiniBand Infrastructure (SLIBI) ================================================ Copyright (C) 2010-2011 University of Leeds This is the early release of a logging and monitoring tool for InfiniBand networks using the Open Fabrics Enterprise Distribution (OFED) stack, principally directed at High Performance Computing (HPC) environments. Its purpose is to allow continual monitoring of error rates on ports, together with the collection of performance data such as throughput. This allows the following to be done during normal cluster operation: * Identification of a hung switch (due to typical IB networks' high degeneracy, this can go unnoticed on a cluster). * Identification of faulty components: cards, cables, switch ports, etc. * Keep track of any ports deliberately disabled because they are logging errors, or are unused. InfiniBand has a horrible habit of enabling disabled ports after a switch power cycle. * Associate names with InfiniBand entities, allow easy identification of what ports are connected to. * It is hoped that later analysis of throughput data may help in capacity planning. Or at least pretty pictures. This version is in production use on a cluster running Red Hat Enterprise Linux 5, Mellanox QDR host cards and switches. The InfiniBand fabric has 449 hosts, 2464 switch ports and is running the minhop routing algorithm. This software is under active development. In particular, this release has an early version of the "slibi" command line tool, which will eventually supercede the separate ibcheck, ibcollect and ibreport programs. License ~~~~~~~ This software is released under the GNU General Public License, version 3. Please see the LICENSE file for details. Prerequisites ~~~~~~~~~~~~~ * MySQL database * Perl5 * Common Perl5 modules, e.g. DBI, Getopt::Long, Data::Dumper, IO::File * Perl5 modules for "slibi" command line interface: Term::ReadLine, Class::Struct, Safe * cron Internally, we make use of a simple Perl module called "SA". A stub version has been provided with this release, providing minimal functionality to allow the software to work. WARNING: when run, this software will reset the error/performance counters on your infiniband ports. Unless you are using other software which makes use of this information (e.g. collectl, the sar tool replacement), this shouldn't concern you. Installation ~~~~~~~~~~~~ * Prepare a database Create a new database within MySQL. Also create two accounts: one with read/write access to it, the other with read access only. CREATE DATABASE infiniband; GRANT ALL ON infiniband.* TO infiniband IDENTIFIED BY 'SOME_PASSWORD'; GRANT SELECT ON infiniband.* TO infiniband_read IDENTIFIED BY 'SOME_PASSWORD'; (replace strings SOME_PASSWORD appropriately - you may also want to review what hosts you allow MySQL logins from - the above will allow remote logins by default) Edit ibcheck, ibcollect and ibreport, replacing the strings SOME_DATABASE, SOME_HOST, SOME_USER and SOME_PASSWORD appropriately. Create the necessary tables using the schema.mysql file: mysql -h SOME_HOST -u SOME_USER -p SOME_DATABASE < schema.mysql * Create cron jobs These need to run under the root account, to allow the various InfiniBand commands to work. Examples: # Collect InfiniBand fabric data 22 * * * * root cd /data/infiniband/bin && perl ./ibcollect # # Report on today's InfiniBand fabric data 44 23 * * * root cd /data/infiniband/bin && ./ibreport_today 2>&1 | mail -s "ib errors summary" SOME_EMAIL_ADDRESS # # Report on up/down links, missing hosts or switches 44 0 * * * root cd /data/infiniband/bin && ./ibcheck 2>&1 | mail -s "ib node summary" SOME_EMAIL_ADDRESS (edit times and SOME_EMAIL_ADDRESS appropriately) * Update switch information Hostnames are automatically populated each time ibcollect is run. This information ultimately comes from what the IB stack on the host reports it as. Switch names are not automatically populated, apart from when they are first discovered. You may wish to update the "name" field in table "nodes" to something more memorable. The "enclosures" table helps keep track of the different switch types and will be used to map logical ports to physical ports. This will aid the identification of physical cables for less usual IB switch types (e.g. those connectors containing 3 ports). You may wish to set the "node" enclosure_id appropriately. FAQ ~~~ Q: I've run ./ibreport_today and it doesn't print anything! A1: Well done - it hasn't found any problems :) A2: Take a look at the contents of table "timedata". You should have a tuple for each port multiplied by the number of times you've run ibcollect. You need to run ibcollect at least twice before slibi can calculate if ports have exceeded error rates. Q: What error rates are really bad? A: There's a lot of debate about this. SymbolErrors seem to be the most minor (packets that don't make sense), then RcvErrors (data errors), then LinkRecovers are the most serious (port flapping, resulting in a topology change and recalculation). The top of ibreport has some example definitions of "bad" error rates, but we're not experts. Q: What are these error numbers that are suspiciously power of 2? A: InfiniBand counters don't wrap. The field is a certain size, so if you see a power of 2, it has probably overflowed. This is why ibcollect resets all counters each time it is run. Q: Why don't use use a Round Robin Database (RRD) to store this information, instead of a RDBMS like MySQL? A1: We've got lots of disk space A2: RRD systems such as rrtool/Ganglia/Cacti are great at storing aggregate information about time series while occupying a static amount of space. However, the nature of routing in InfiniBand means that you tend to be interested in specific error events, not looking at averages. A3: Our cluster has almost 3000 switch and host ports. For each port, we're collecting 16 items of information, together with how the ports are connected to each other. SQL seemed the obvious choice to handle this. Q: What OFED commands does it use? A: ibnetdiscover (for topology information), perfquery (for access to counters), ibportstate (for access to port status). Q: Why did you write this? A1: We had Infiniband problems, seriously affecting our Lustre filesystem. We also had real users we couldn't kick-off. We needed something to keep track of error rates over time under normal use. A2: We found the documentation of the normal OFED tools impenetrable. The diagnostic tools seem best suited for debugging a cluster running a known test workload, and not one in production. We wanted something to keep an eye on things with an unknown workload. TODO ~~~~ * Document how to use it :) * [IN PROGRESS] Remove the root access requirement: use sudo to access privileged operations. * [IN PROGRESS] Create a separate configuration file, allowing the specification of database details in one place. Also use to allow configuration of tool - e.g. only collect error data, only collect performance data, etc. * [IN PROGRESS] Unify the different commands under a single CLI. * Allow modification of the database through the CLI, instead of seat-of-the-pants modification of the SQL database. * Check how portable this software is, try and get it working on other cluster and other InfiniBand vendor equipment. * GraphViz visualisation of the topology (or subset). * Extension of to cope with identification of physical cables carrying multiple links (e.g. like our existing Sun 3-way cables). Already partly implemented, but currently non-functional. * Keep track of node names at different points in time, instead of just keeping the last one seen. * Analysis of traffic to look for bottlenecks. Example database queries ~~~~~~~~~~~~~~~~~~~~~~~~ * Get the port_id for a GUID/port combination: SELECT ports.id as port_id, nodes.name, nodes.guid, ports.port, ports.status FROM nodes,ports WHERE nodes.id = ports.node_id AND nodes.guid = '0x5080020000b3b5dd'; * Get a description for a port_id SELECT nodes.name,guid,port,ports.status,ports.id FROM nodes,ports WHERE nodes.id = ports.node_id AND ports.id = 1635; * The above works for a host. If nodes.name starts 'I4', it's a switch. Use this instead: SELECT enclosures.name,nodes.name,guid,port,ports.status,ports.id FROM enclosures,nodes,ports WHERE enclosures.id = nodes.enclosure_id AND nodes.id = ports.node_id AND ports.id = 1635; * Get the GUIDs for a node: SELECT nodes.name, nodes.guid, ports.port, ports.id,ports.status FROM nodes,ports WHERE nodes.id = ports.node_id AND nodes.name like 'c1s3b3n%'; * Last port_id seen connected to a port_id SELECT ports.id as port_id, nodes.name, nodes.guid, ports.port, ports.status FROM nodes,ports WHERE nodes.id = ports.node_id AND ports.id = ( SELECT conn_port_id FROM timedata WHERE port_id = 1245 AND conn_port_id IS NOT NULL ORDER BY TIMESERIES_ID DESC LIMIT 1 ); * List ports with a status of faulty SELECT enclosures.name, nodes.name, nodes.guid, ports.port, ports.status FROM enclosures,nodes,ports WHERE enclosures.id = nodes.enclosure_id AND nodes.id = ports.node_id AND ports.status = 'faulty' ORDER BY enclosures.name;