Managing the Mellanox Infiniband Network
Mellanox IB Interface Configuration
Note: To configure IB interfaces with xCAT 2.8 and above, see Configuring_Secondary_Adapters.
XCAT provides two sample postscripts - configiba.1port and configiba.2ports to configure the IB secondary adapter. These two scripts can run on either AIX and Linux nodes.
There are two ways to configure IB interfaces, either during node installation or using the updatenode command to update the node after the node is installed. Most of the configuration steps for the two ways are the same.
Select the correct sample scripts
The two scripts are stored in /opt/xcat/share/xcat/ib/scripts. Each IB adapter has two ports. If there is only one port available per adapter, you should use configiba.1port. If two ports are available per adapter, use the script configiba.2ports.
One port available:
cp /opt/xcat/share/xcat/ib/scripts/configiba.1port /install/postscripts/configiba
Two ports available:
cp /opt/xcat/share/xcat/ib/scripts/configiba.2ports /install/postscripts/configiba
Note: A new postscript /install/postscripts/configib is shipped with xCAT 2.8, the configib postscript works with the new "nics" table and confignic postscript, which where introduced in xCAT 2.8 also. The configiba.1port and configiba.2ports will still work but will be in maintenance mode.
Modify the sample postscript
- Modify the netmask and gateway values:
In the sample postscript, the netmask is hardcoded to 255.255.255.0 and the gateway is hardcoded to "X.X.255.254". If these values are not appropriate for your environment, change them in the script.
If the IB interface name is not a simple combination of short hostname and ibX or netmask and gateway does not meet the user's requirement, then modify the sample script , as in the example below:
- Modify the hostname and IP address scheme:
The default scheme used by the postscript to determine the hostname and IP address for each IB interfaces is:
- form the hostname of the IB interface by concatenating the node name with the interface name
- resolve this hostname to get the IP address associated with it
- use this hostname and IP address to configure the IB interface
If this scheme doesn't work for you, then modify it in the postscript. For example, if the node name of the compute node is xcat01-en (a hostname of *-eth* is not supported) , and the IB interface name is xcat01-ib0, xcat01-ib1, etc. The user should modify the /install/postscript/configiba as follows:
if [ $NODE ] then hostname="$NODE-$nic" else hostname="$HOST-$nic" fi
fullname=`echo $NODE | cut -c 1-11` hostname="$fullname-$nic"
For additional information about the hostname/IP address scheme, see the documentation for configuring additional ethernet adapters, which uses a similar scheme: Configuring Secondary Adapters.
- Modify the IB adapter number:
It is assumed every node has one IB adapter, if there are two adapters available on each node, modify the /install/postscript/configiba as following: (In some old xCAT release, please check the two sample postscripts(configiba.1port and configiba.2ports) )
for num in 0 1
for num in 0 1 2 3
In the latest release, the script could find the adapter number by commands, so this step is not needed.
- Modify the active port number:
For AIX, in the configiba.1port script, it assumes that the port 1 of the IB is Active, and the port 2 is Down. In your environment , if the port 2 is Active and port 1 is Down, you should change the port=1 to port=2 manually before using it.
#Configure the IB interfaces. Customize the port num. iba_num=$num ib_adapter="iba$iba_num" port=1
#Configure the IB interfaces. Customize the port num. iba_num=$num ib_adapter="iba$iba_num" port=2
(AIX Only) Temporary Fix for Adapter Node Name
To aid in monitoring and debugging the IB fabric, it is very useful for each endpoint to have the proper node name associated with it. In linux clusters this happens automatically as it should. In AIX clusters, the AIX device driver doesn't yet put the node name correctly into the IB NIC definition. The AIX developement team is working on a fix for this, but in the mean time, the xCAT configiba postscript can be modified as shown below to accomplish it. Note that this will work well for AIX diskless nodes, since the postscript will run every time the node boots. For AIX diskful nodes, the postscript will only be run during initial install of the node. You will also need to add a similar script to an rc file for subsequent boots.
Replace this section in the configiba script:
elif [ $PLTFRM == "AIX" ] then lsdev -C | grep icm | grep Available if [ $? -ne 0 ] then mkdev -c management -s infiniband -t icm if [ $? -ne 0 ] then mkdev -l icm if [ $? -ne 0 ] then exit $? fi fi fi #Configure the IB interfaces. Customize the port num. iba_num=$num ib_adapter="iba$iba_num" port=1 mkiba -a $ip -i $nic -A $ib_adapter -p $port -P -1 -S up -m $netmask fi
elif [ $PLTFRM == "AIX" ] then if [ $num -eq 0 ] then rmdev -dl ib0 rmdev -dl iba0 rmdev -dl ib1 rmdev -dl iba1 rmdev -dl ib2 rmdev -dl iba2 rmdev -dl ib3 rmdev -dl iba3 rmdev -dl icm mkdev -c management -s infiniband -t icm cfgmgr fi #Configure the IB interfaces. Customize the port num. iba_num=$num ib_adapter="iba$iba_num" port=1 mkiba -a $ip -i $nic -A $ib_adapter -p $port -P -1 -S up -m $netmask -k on fi
Modify the /etc/hosts file
The IP address entries for IB interfaces in /etc/hosts on the xCAT management node should use a hostname that is a combination of the node name (usually the short hostname) and the unique IB interface name.
The format should be as follows:
xcat01 is the node name, xcat01-ib0, xcat01-ib1, xcat01-ib2, etc. are the host names for the IB interfaces on xcat01.
For AIX, ml0 interface is also required to be setup together with IB interfaces. It follows the same name conversion with IB interfaces.
Following is an example of /etc/hosts for AIX:
192.168.0.10 xcat01 192.168.1.10 xcat01-ib0 192.168.2.10 xcat01-ib1 192.168.3.10 xcat01-ib2 192.168.4.10 xcat01-ib3 192.168.5.10 xcat01-ml0
For large networks, you can more easily maintain your /etc/hosts file by using the xCAT makehosts command. If your node hostnames and IP addresses follow a regular pattern, use a few regular expressions in the hosts table and then easily generate /etc/hosts using makehosts. See the makehosts man page for options.
For example, add a line to the hosts table like the one below, where compute is a group of nodes that are defined in your system.
The regular expressions above have the format: |pattern-match-on-the-nodename|value-to-put-in-this-col| . In this example, there is a regular expression for the ip column and a regular expression for the otherinterfaces column:
- ip column: match node names that have the format xxx-## . Extract the number part of the name and add it to 192.168.0.9 to form the ip address.
- otherinterfaces column: match node names that have the format xxx-## . Extract the number part of the name and create an entry like xcat01-ib0:192.168.1.10 .
See the xcatdb man page for a more complete explanation of regular expressions in xCAT tables.
Now that you have the regular expressions set up, each time you add a new node to the group, run makehosts <newnode> and it will be added to your /etc/hosts file.
Define the IB Switch(Optional)
Add the address of the IB Switch to /etc/hosts
Update networks table with IB sub-network
chdef -t network -o en0 net=192.168.0.0 mask=255.255.255.0 mgtifname=en0 nameservers=192.168.0.13 chdef -t network -o ib0 net=192.168.1.0 mask=255.255.255.0 mgtifname=ib0 chdef -t network -o ib1 net=192.168.2.0 mask=255.255.255.0 mgtifname=ib1 chdef -t network -o ib2 net=192.168.3.0 mask=255.255.255.0 mgtifname=ib2 chdef -t network -o ib3 net=192.168.4.0 mask=255.255.255.0 mgtifname=ib3 chdef -t network -o ib4 net=192.168.5.0 mask=255.255.255.0 mgtifname=ib4
Note: Attributes gateway, dhcpserver, tftpserver, and nameservers in networks table are not necessary for IB networks, since the xCAT management work is still running on ethernet. But nameservers on ethernet network need to be set for the DNS server which will provide name resolution for IB interfaces.
Setup name server on management node
Put IB interface entries in /etc/hosts into DNS and restart the DNS:
For Linux Management Nodes:
makedns service named restart
For AIX Management Nodes:
makedns stopsrc -s named startsrc -s named
Check the IB network
Check if DNS resolution of the IB network has been setup successfully on management node . If not, check the steps the previous setup steps.
nslookup xcat01-ib0 nslookup xcat01-ib1
Prepare IB drivers/libraries
For AIX, the IB drivers/libraries have been installed in the system.
This step is only needed for RHEL and SLES.
For Mellonax IB QDR, the drivers/libraries are in the Mellonax OFED rhels/sless release ISO. So the Mellonax OFED ISO is needed, such as MLNX_OFED_LINUX-1.5.3-3.0.0-sles11sp1-x86_64.iso, MLNX_OFED_LINUX-1.5.3-3.0.0-rhel6.1-x86_64.iso and MLNX_OFED_LINUX-1.5.3-2.0.0-rhel6.1-ppc64.iso. Mount the distribution media onto Suggested target location on the xCAT MN:
mkdir -p /install/post/otherpkgs/<osver>/<arch>/ofed mount -o loop MLNX_OFED_LINUX-<packver1>-<packver2>-<osver>-<arch>.iso /install/post/otherpkgs/<osver>/<arch>/ofed
Take sles11 sp1 for x86_64 as an example:
mkdir -p /install/post/otherpkgs/sles11.1/x86_64/ofed/ mount -o loop MLNX_OFED_LINUX-1.5.3-3.0.0-sles11sp1-x86_64.iso /install/post/otherpkgs/sles11.1/x86_64/ofed/
Configure IB interfaces during Node installation
- Copy the xCAT mlnxofed_ib_install script file to postscripts directory:
cp /opt/xcat/share/xcat/ib/scripts/Mellanox/mlnxofed_ib_install /install/postscripts/mlnxofed_ib_install
- Configuration for diskfull installation
chdef xcat01 -p postbootscripts=mlnxofed_ib_install,configiba
- copy the pkglist to the custom directory:
cp /opt/xcat/share/xcat/install/<ostype>/compute.<osver>.<arch>.pkglist /install/custom/install/<ostype>/compute.<osver>.<arch>.pkglist
- Edit your /install/custom/install/<ostype>/compute.<osver>.<arch>.pkglist and add:
- Make sure the related osimage use the customized pkglist.
lsdef -t osimage -o <osver>-<arch>-install-compute if not, change it: chdef -t osimage -o <osver>-<arch>-install-compute pkglist=/install/custom/install/<ostype>/compute.<osver>.<arch>.pkglist
- Configuration for diskless installation
chdef xcat01 -p postscripts=configiba
- copy the pkglist to the custom directory:
cp /opt/xcat/share/xcat/netboot/<ostype>/compute.<osver>.<arch>.pkglist /install/custom/netboot/<ostype>/compute.<osver>.<arch>.pkglist
- Edit your /install/custom/netboot/<ostype>/<profile>.pkglist and add:
- Take sles11 sp1 for x86_64 as an example:
- Edit the /install/custom/netboot/sles11.1/x86_64/compute/compute.sles11.1.x86_64.pkglist and add:
- Add to postinstall scripts:
- Edit your /install/custom/netboot/<ostype>/<profile>.postinstall and add:
installroot=$1 ofeddir=/install/post/otherpkgs/<osver>/<arch>/ofed/ NODESETSTATE=genimage /install/postscripts/mlnxofed_ib_install
- Take sles11 sp1 for x86_64 as an example:
- Edit the /install/custom/netboot/sles/compute.postinstall and add:
installroot=$1 ofeddir=/install/post/otherpkgs/sles11.1/x86_64/ofed/ NODESETSTATE=genimage /install/postscripts/mlnxofed_ib_install
- Make sure the related osimage use the customized pkglist and customized compute.postinsall
lsdef -t osimage -o <osver>-<arch>-netboot-compute if not, change it: chdef -t osimage -o <osver>-<arch>-netboot-compute pkglist=/install/custom/netboot/<ostype>/compute.<osver>.<arch>.pkglist postinstall=/install/custom/netboot/<ostype>/<profile>.postinstall
Update the xCAT postscripts table
Add your modified ib setup script to the postscripts list for your node install.
chdef xcat01 -p postscripts=configiba
Update permission for modeprobe.conf and sysctl.conf
In statelite images, /etc/infiniband/, /etc/modprobe.conf and /etc/sysctl.conf are not writable by default, which will be modified by configiba script. You must make sure /etc/infiniband/, /etc/modprobe.conf and /etc/sysctl.conf are writable in the statelite image. For more detailed information on statelite configuration, check statelite documentation:
Add perl packages
Add perl packages(Only for xCAT version below 2.6.6)
Since perl is not by default installed on Linux nodes. Postscripts configiba which is written in perl would failed. The admin needs to add perl in diskless image or add the perl install rpms for diskfull nodes.
Also, for diskless boot on Linux, remove the following line in /opt/xcat/share/xcat/netboot/<os>/compute.exlist to add perl packages:
Add perl packages(Only for xCAT 2.6.6 later version )
The configiba has been written by SHELL. But for Mellanox OFED on sles, there are some packages which are dependent on the perl modules. So for diskless/statelite boot on sles, we should remove the following line in /opt/xcat/share/xcat/netboot/sles/compute.exlist to add perl packages:
Start to install the nodes or update the nodes for IB configuration
Now all the preparation work for IB configuration has been done, you can use the updatenode command to update the nodes if the nodes have been installed
updatenode xcat01 configiba
or continue with the node installation process,
For diskless Linux nodes:
You have to install the IB device driver packages into diskless image before node installation, for more details, check section :
After doing this run:
nodeset xcat01 osimage=<osver>-<arch>-netboot-compute rnetboot xcat01
For diskful Linux nodes:
nodeset xcat01 osimage=<osver>-<arch>-install-compute rnetboot xcat01
To install diskful AIX nodes:
nimnodeset -i <nimimage> xcat01 rnetboot xcat01
To install diskless boot AIX nodes:
mkdsklsnode -i <nimimage> xcat01 rnetboot xcat01
Check the result of IB configuration
It's assumed that there are IB adapters in MN. Use a ping test from management node to the IB interfaces on compute nodes to see if the IB adapter works.
On SLES there is an issue with openibd that a compute node reboot or openbid restart resets two settings in /etc/sysctl.conf, which have been modified by configiba script. So for every reboot or openibd restart, the admin will have to update the settings manually. The following three commands help to do that efficiently from the management node:
xdsh xcat01 sed -i 's/net.ipv4.conf.ib0.arp_filter=0/net.ipv4.conf.ib0.arp_filter=1/g' /etc/sysctl.conf
xdsh xcat01 sed -i 's/net.ipv4.conf.ib0.arp_ignore=0/net.ipv4.conf.ib0.arp_ignore=1/g' /etc/sysctl.conf
xdsh xcat01 sysctl -p
Mellanox Switch Configuration
Setup the xCAT Database
The Mellanox Switch is only supported in xCAT Release 2.7 or later.
- Add the switch ip address in the /etc/hosts file
- Define IB switch as a node
chdef -t node -o mswitch groups=all nodetype=switch mgt=switch
- Add the login user name and password to the switches table:
tabch switch=mswitch switches.sshusername=admin switches.sshpassword=admin switches.switchtype=MellanoxIB
The switches table will look like this:
If there is only one admin and one password for all the switches then put the entry in the xCAT passwd table for the admin id and password to use to login.
tabch key=mswitch passwd.username=admin passwd.password=admin
The passwd table will look like this:
Setup ssh connection to the Mellanox Switch
To run commands like xdsh and script to the Mellanox Switch, we need to setup ssh to run without prompting for a password to the Mellanox Switch. To do this, first you must add a configuration file. This configuration file is NOT needed for xCAT 2.8 and later.
mkdir -p /var/opt/xcat/IBSwitch/Mellanox cd /var/opt/xcat/IBSwitch/Mellanox cp /opt/xcat/share/xcat/ib/scripts/Mellanox/config .
The file contains the following:
[main] [xdsh] pre-command=cli post-command=NULL
Then run the following:
rspconfig mswitch sshcfg=enable
Setup syslog on the Switch
Use the following command to consolidate the syslog to the Management Node or Service Nodes, where ip is the addess of the MN or SN as known by the switch.
rspconfig mswitch logdest=<ip>
Configure xdsh for Mellanox Switch
To run xdsh commands to the Mellanox Switch, you must use the --devicetype input flag to xdsh. In addition, for xCAT versions less than 2.8, you must add a configuration file, please see "Setup ssh connection to the Mellanox Switch" section.
For the Mellanox Switch the --devicetype is "IBSwitch::Mellanox". See xdsh man page: http://xcat.sourceforge.net/man1/xdsh.1.html for details.
Now you can run the switch commands from the mn using xdsh. For example:
xdsh mswitch -l admin --devicetype IBSwitch::Mellanox 'enable;configure terminal;show ssh server host-keys'
Commands Supported for the Mellanox Switch
Setup the snmp alert destination:
rspconfig <switch> snmpdest=<ip> [remove] where "remove" means to remove this ip from the snmp destination list.
Enable/disable setting the snmp traps.
rspconfig <switch> alert=enable/disable
Define the read only community for snmp version 1 and 2.
rspconfig <switch> community=<string>
Enable/disable snmp function on the swithc.
rspconfig <switch> snmpcfg=enable/disable
Enable/disable ssh-ing to the switch without password.
rspconfig <switch> sshcfg=enable/disable
Setup the syslog remove receiver for this switch, and also define the minimum level of severity of the logs that are sent. The valid levels are: emerg, alert, crit, err, warning, notice, info, debug, none, remove. "remove" means to remove the given ip from the receiver list.
rspconfig <switch> logdest=<ip> [<level>]
For doing other tasks on the switch, use xdsh. For example:
xdsh mswitch -l admin --devicetype IBSwitch::Mellanox 'show logging'
Interactive commands are not supported by xdsh. For interactive commands, use ssh.
Send SNMP traps to xCAT Management Node
First, get http://www.mellanox.com/related-docs/prod_ib_switch_systems/MELLANOX-MIB.zip, unzip it. Copy the mib file MELLANOX-MIB.txt to /usr/share/snmp/mibs directory on the mn and sn (if the sn is the snmp trap destination.)
To configure, run:
monadd snmpmon moncfg snmpmon <mswitch>
To start monitoring, run:
monstart snmpmon <mswitch>
To stop monitoring, run:
monstop snmpmon <mswitch>
To deconfigure, run:
mondecfg snmpmon <mswitch>
UFM server are just regular Linix boxes with UFM installed. xCAT can help install and configure the UFM servers. The xCAT mn can send remote command to UFM through xdsh. It can also collect SNMP traps and syslogs from the UFM servers.
Setup xdsh to UFM and backup
Assume we have two hosts with UFM installed, called host1 and host2. First define the two hosts in the xCAT cluster. Usually the network that the UFM hosts are in a different than the compute nodes, make sure to assign correct servicenode and xcatmaster in the noderes table. And also make sure to assign correct os and arch values in the nodetype table for the UFM hosts. For example:
mkdef -t node -o host1,host2 groups=ufm,all os=sles11.1 arch=x86_64 servicenode=10.0.0.1 xcatmaster=10.0.0.1
Then exchange the SSH key so that it can run xdsh.
xdsh host1,host2 -K
Now we can run xdsh on the UFM hosts.
xdsh ufm date
Run the following command to make the UFM hosts to send the syslogs to the xCAT mn:
updatenode ufm -P syslog
To test, run the following commands on the UFM hosts and see if the xCAT MN receives the new messages in /var/log/messages
logger xCAT "This is a test"
Send SNMP traps to xCAT Management Node
You need to have the Advanced License for UFM in order to send SNMP traps.
1. Copy the mib file to /usr/share/snmp/mibs directory on the mn.
scp ufmhost:/opt/ufm/files/conf/vol_ufm3_0.mib /usr/share/snmp/mibs
where ufmhost is the host where UFM is installed.
2. On the UFM host, open the /opt/ufm/conf/gv.cfg configuration file. Under the [Notifications] line, set
snmp_listeners = <IP Address 1>[:<port 1>][,<IP Address 2>[:<port 2>]…]
the default port is 162. For example:
ssh ufmhost vi /opt/ufm/conf/gv.cfg .... [Notifications] snmp_listeners = 10.0.0.1 where 10.0.0.1 is the the ip address of the management node.
3. On the UFM host, restart the ufmd.
service ufmd restart
4. From UFM GUI, click on the "Config" tab; bring up the "Event Management" Policy Table. Then select the SNMP check boxes for the events you are interested in to enable the system to send an SNMP traps for these events. Click "OK".
5. Make sure snmptrapd is up and running on mn and all monitoring servers.
- It should have the '-m ALL' flag.
ps -ef |grep snmptrapd root 31866 1 0 08:44 ? 00:00:00 /usr/sbin/snmptrapd -m ALL
- If it is not running, then run the following commands:
monadd snmpmon monstart snmpmon
Mellanox Switch and Adapter Firmware Update
Adapter Firmware Update
The adapter firmware update process differs depending on whether running AIX or Linux. The general steps are the same, however, the commands to perform the upgrade are different since the firmware image is packaged differently. Please download the OFED IB adapter firmware from the Mellanox site http://www.mellanox.com/page/firmware_table_IBM .
AIX OS image
i> Obtain device id
lscfg -vp -l iba*
ii> Check current installed fw level
lscfg -vp -l iba0 |grep ROM
iii> Copy firmware to /etc/microcode
iv> Burn new firmware on each ibaX
diag -cd iba0 -T "download -f"
v) Verify download successful
diag -d iba0 -T disp_mcode
vi) Activate the new firmware
reboot the image
Note: the above iba0 device id was used as an example only. it is not meant to imply that there is only one device id.
Linux OS image
i) Obtain device id
lspci | grep -i mel
ii) Check current installed fw level
mstflint -d 0002:01:00.0 q | grep FW
iii) Copy or mount firmware to host
iv) Burn new firmware on each ibaX
mstflint -d 0002:01:00.0 -i <image location> b
Note: if this is a PureFlex MezzanineP adapater then you must select the correct image for each ibaX device. Note the difference in the firmware image at end of filename: *_0.bin (iba0/iba2) & *_1.bin (iba1/iba3)
v) Verify download successful
mstflint -d 0002:01:00.0 q
vi) Activate the new firmware
reboot the image
Note: the above 0002:01:00.0 device location was used as an example only. it is not meant to imply that there is only one device location or that your device will have the same device location.
Mellanox Switch Firmware Upgrade
This section provides manual procedure to help update the firmware for Mellanox Infiniband (IB) Switches. You can down load IB switch firmware like IB6131 (image-PPC_M460EX-SX_3.2.xxx.img) from the Mellanox website http://www.mellanox.com/page/firmware_table_IBM and place into your xCAT Management Node or server that can communicate to Flex IB6131 switch module. There are two ways to update the MLNX-OS switch package. This process works regardless if updating an internal PureFlex chassis Infiniband switch (IB6131 or for an external Mellanox switch.
Update via Browser
This method is straight forward if your switches are on the public network or your browser is already capable to tunnel to the private address. If neither is the case then you may prefer to use option two.
i) After logging into the switch (id=admin, pwd=admin)
- Select the "System" tab and then the "MLNX-OS Upgrade" option
- Under the "Install New Image", select the "Install via scp"
- URL: scp://userid@fwhost/directoryofimage/imagename
- Select "Install Image"
- The image will then be downloaded to the switch and the installation process will begin.
ii) Once completed, the switch must be rebooted for the new package to be activated.
Firmware Update using CLI
1) Login to the IB switch
ssh admin@<switchipaddr> enable (get into correct CLI mode. You can use en) configure terminal (get into correct CLI mode. You can use co t)
2) List current images and Remove older images to free up space
show image image delete <ibimage> (you can paste in ibimage name from show image for image delete)
3) Get the new IB image using fetch with scp to a server that contains new IB image. An example of IB3161 image would be "image-PPC_M460EX-SX_3.2.0291.img" Admin can use different protocol . This image fetch scp command is about 4 minutes.
image fetch ? image fetch scp://userid:password@serveripddr/<full path ibimage location>
4) Verify that new IB image is loaded, then install the new showIB image on IB switch. The install image process goes through 4 stages Verify image, Uncompress image, Create Filesystems, and Extract Image. This install process takes about 9 minutes.
show image image install <newibimage> (you can paste in new IB image from "show image" to execute image install)
5) Toggle boot partition to new IB image, verify image install is loaded , and that next boot setting is pointing to new IB image.
image boot next show image
6) Save the changes made for new IB image
7) Activate the new IB image (reboot switch)