From: Larry B. <ba...@us...> - 2002-07-25 18:33:53
|
I have modified the bproc/beoboot node startup scripts to automatically = start the NFS RPC portmapper and status daemon so that an NFS mount = without the "nolock" option succeeds. I also fixed a couple annoyances = (i.e., tar failures), and added support for ext3 fstypes. Locking support is required for the MPI-2 parallel IO routines (I have = MPICH 1.2.4). I am not sure yet that it is completely working; while = things have improved, a couple of the MPICH IO test routines still fail. = I hope to track that down soon. I'm going to try to add a bit more user control of the file system setup = done in setup_fs through the /etc/beowulf/config configuration file. = For example, I'd like to add entries that list the names of directories = that get automatically created, such as /proc, /etc, /tmp, and /scratch. = These are hard-coded now. Also, I'd like to make support for the RPC = portmapper and status daemon optional (e.g., always, auto, never). = Finally, I'd like to get the contents for /etc/nsswitch.conf either from = a file in /etc/beowulf, or from /etc/beowulf/config. Below are the files I have modified/use: /etc/exports The NFS file systems exported by the master /etc/beowulf/fstab The file systems file for the nodes /etc/beowulf/config The bproc/beoboot configuration file /etc/beowulf/node_up The beoboot stub node startup script /usr/lib/beoboot/bin/setup_fs The beoboot node file system setup = script After rebooting, this is what /var/log/beowulf/node.0 looks like: node_up: Setting system clock. node_up: TODO set interface netmask. node_up: Configuring loopback interface. setup_fs: Configuring node filesystems... setup_fs: Using /etc/beowulf/fstab setup_fs: Checking 192.168.50.209:/bin (type=3Dnfs)... setup_fs: Mounting 192.168.50.209:/bin on /rootfs/bin... (type=3Dnfs; = options=3Dro,nolock,rsize=3D8192) setup_fs: Checking 192.168.50.209:/home (type=3Dnfs)... setup_fs: Mounting 192.168.50.209:/home on /rootfs/home... (type=3Dnfs; = options=3Drw,rsize=3D8192,wsize=3D8192,noac) setup_fs: Mount deferred until lock daemon running. setup_fs: Checking 192.168.50.209:/opt (type=3Dnfs)... setup_fs: Mounting 192.168.50.209:/opt on /rootfs/opt... (type=3Dnfs; = options=3Dro,nolock,rsize=3D8192) setup_fs: Checking 192.168.50.209:/sbin (type=3Dnfs)... setup_fs: Mounting 192.168.50.209:/sbin on /rootfs/sbin... (type=3Dnfs; = options=3Dro,nolock,rsize=3D8192) setup_fs: Checking 192.168.50.209:/usr (type=3Dnfs)... setup_fs: Mounting 192.168.50.209:/usr on /rootfs/usr... (type=3Dnfs; = options=3Dro,nolock,rsize=3D8192) setup_fs: Checking 192.168.50.209:/var/node.0 (type=3Dnfs)... setup_fs: Mounting 192.168.50.209:/var/node.0 on /rootfs/var... = (type=3Dnfs; options=3Drw,nolock,rsize=3D8192,wsize=3D8192) setup_fs: Checking none (type=3Dproc)... setup_fs: Mounting none on /rootfs/proc... (type=3Dproc; = options=3Ddefaults) setup_fs: Checking none (type=3Ddevpts)... setup_fs: Mounting none on /rootfs/dev/pts... (type=3Ddevpts; = options=3Dgid=3D5,mode=3D620) node_up: populating /dev and /etc node_up: Copying over device nodes. node_up: Copying over time zone info. node_up: Copy over nsswitch info. node_up: Node setup finished. /etc/beowulf/node_up: Copy files into /etc for /etc/nsswitch.conf. /etc/beowulf/node_up: Start the RPC portmapper and status daemon. /etc/beowulf/node_up: Complete deferred network mounts. /etc/beowulf/node_up: Soft link /tmp to /var/tmp. /etc/beowulf/node_up: Soft link /scratch to /home/node.0. Larry Baker US Geological Survey ba...@us... # # /etc/exports # # Read-only exports # /bin 192.168.50.209/255.255.255.224(ro) /opt 192.168.50.209/255.255.255.224(ro) /sbin 192.168.50.209/255.255.255.224(ro) /usr 192.168.50.209/255.255.255.224(ro) # # Private read-write exports # /var/node.0 192.168.50.210(rw,no_root_squash) /var/node.1 192.168.50.211(rw,no_root_squash) # # Shared read-write exports (MPICH 1.2.4, section 4.11.1: use "noac") # /home 130.118.45.45/255.255.252.0(rw) \ 192.168.50.209/255.255.255.224(rw,no_root_squash) # # /etc/beowulf/fstab # # This file is the fstab for nodes. # One difference is that we allow for shell variable expansions... # # Variables that will get substituted: # MASTER =3D IP address of the master node. (good for doing NFS = mounts) # NODE =3D slave's node no. # RAMDISK =3D device name (/dev/<ramdev>) of a device suitable for a = root fs # # A cooked version (with variable substitution) of this file will be = copied # to /etc/fstab on the slave node. # # The root file system is a tmpfs provided by the boot scripts. You # can mount something on / if you'd like but due to oddities in the file # caching code it's not recommended right now. # This is the default setup from beofdisk, once you setup your disks. #/dev/hda2 swap swap defaults 0 0 #/dev/hda3 / ext2 defaults 0 0 # These should always be added none /proc proc defaults 0 0 none /dev/pts devpts gid=3D5,mode=3D620 0 0 # NFS (for example and default friendliness) # Note: Mounts without the "nolock" option are deferred until the RPC = portmapper # and status daemons are running -- see the instructions in = /etc/beowulf/node_up # # Read-only mount points # $MASTER:/bin /bin nfs ro,nolock,rsize=3D8192 0 0 $MASTER:/opt /opt nfs ro,nolock,rsize=3D8192 0 0 $MASTER:/sbin /sbin nfs ro,nolock,rsize=3D8192 0 0 $MASTER:/usr /usr nfs ro,nolock,rsize=3D8192 0 0 # # Private read-write mount points # $MASTER:/var/node.$NODE /var nfs rw,nolock,rsize=3D8192,wsize=3D8192 0 0 # # Shared read-write mount points (MPICH 1.2.4, section 4.11.1: use = "noac") # $MASTER:/home /home nfs rw,rsize=3D8192,wsize=3D8192,noac 0 0 # # /etc/beowulf/config # # Sample Beowulf Configuration file # # $Id: config,v 1.7 2002/03/12 20:54:58 hendriks Exp $ # # # Default cluster configuration (uses eth1, and 192.168.1.0/24) # interface: internal cluster interface (the one connected to the = nodes) # # iprange: range of IP addresses for nodes. interface eth1 192.168.50.209 255.255.255.224 # Setup addresses in the cluster. The "nodes" line is REQUIRED here to = specify # cluster size. "iprange" and "ip" assign addresses to nodes. The "0" = in # iprange here tells it to start assigning at node zero. nodes 2 iprange 0 192.168.50.210 192.168.50.211 # Default libraries (These are the libraries which will automagically be = made # available to the slaves.) # No line continuation, multiple entries allowed libraries /lib /usr/lib /usr/X11R6/lib libraries /opt/intel/compiler60/ia32/lib /opt/intel/mkl/lib/32 # Default file system policies. fsck full mkfs if_needed # Default location of boot images bootfile /var/beowulf/boot.img kernelimage /boot/vmlinuz-2.4.18-lanl.16 kernelcommandline apm=3Dpower-off # Here we assign MAC addresses to nodes. Nodes can have multiple MAC # addresses. Here the optional "0" zero argument states that the = address # should be assigned to node zero. Node lines following that will = assign # addresses to nodes sequentially # D-Link DFE-500TX PCI card (DEC 21140-A chip) #node 0 00:40:05:36:66:83 #node 00:40:05:40:60:e7 # Onboard RealTek RTL8100BL chip) node 0 00:40:63:c0:5e:08 node 00:40:63:c0:5f:b4 #!/bin/sh # # /etc/beowulf/node_up # # This shell script is called automatically by BProc to perform any # steps necessary to bring up the nodes. This is just a stub script # pointing to the real script NODE=3D$1 MASTER=3D`bpstat -a master` BINDIR=3D/usr/lib/beoboot/bin PATH=3D/sbin:/usr/sbin:$PATH:$BINDIR # Standard location of statd database files #SMDIR=3D/var/lib/nfs # Location of statd database files on Red Hat Linux SMDIR=3D/var/lib/nfs/statd $BINDIR/node_up $* || exit 1 # At this point, all file systems in $NODE:/etc/fstab have been mounted, # except for network devices (host:export) without the "nolock" option. # The following sections finish preparing the node for "mount -a", = below. # (Currently, only the RPC portmapper and status daemon are started, if # necessary, for NFS file systems (fstype=3Dnfs). Other fstypes may = require # similar preparation.) # NFS devices without the "nolock" option require the RPC portmapper and # status daemons. The status daemon requires read/write access to the # /var/lib/nfs/statd/sm and .../sm.bak directories, which must exist and # be owned 700 by rpcuser (see http://nfs.sourceforge.net, item 17). # True if there are any NFS mounts in $NODE:/etc/fstab without the = "nolock" # option, i.e., that need the RPC portmapper and status daemon. if [ `bpsh -n $NODE cat /etc/fstab | \ while read line ; do if [ -n "${line}" -a "${line:0:1}" !=3D "#" ] ; then echo "${line}" | ( \ read device mountpt fstype options rest && \ echo ${fstype} | grep -q "nfs" && \ echo ${options} | grep -q -v "nolock" \ ) && echo "${line}" fi done | \ wc -l` > 0 ] ; then # Copy the files needed for the Name Service Switch (NSS) to /etc # (needed by getpwnam(), etc., in #include <pwd.h>, called by rpc.statd) echo "/etc/beowulf/node_up: Copy files into /etc for = /etc/nsswitch.conf." bpcp /etc/passwd $NODE:/etc bpcp /etc/group $NODE:/etc bpcp /etc/rpc $NODE:/etc # Replace the NSS config file cat << EOF | bpsh -n $NODE --stdout /etc/nsswitch.conf cat # # /etc/nsswitch.conf # hosts: bproc passwd: bproc files group: bproc files rpc: files EOF # Create /var/lib/nfs/statd/sm and .../sm.bak owned 700 by rpcuser (Red = Hat) bpsh -n $NODE mkdir -p $SMDIR/sm bpsh -n $NODE chmod 700 $SMDIR/sm bpsh -n $NODE mkdir -p $SMDIR/sm.bak bpsh -n $NODE chmod 700 $SMDIR/sm.bak if echo $SMDIR | grep -q "/statd" ; then bpsh -n $NODE chmod 700 $SMDIR bpsh -n $NODE chown rpcuser $SMDIR bpsh -n $NODE chgrp rpcuser $SMDIR bpsh -n $NODE chown rpcuser $SMDIR/sm bpsh -n $NODE chgrp rpcuser $SMDIR/sm bpsh -n $NODE chown rpcuser $SMDIR/sm.bak bpsh -n $NODE chgrp rpcuser $SMDIR/sm.bak fi # Start the RPC portmapper and status daemon echo "/etc/beowulf/node_up: Start the RPC portmapper and status = daemon." bpsh -n $NODE initlog -c portmap bpsh -n $NODE initlog -c rpc.statd fi # Mount the network devices that were deferred earlier echo "/etc/beowulf/node_up: Complete deferred network mounts." bpsh -n $NODE mount -a ##### Add commands here to complete the setup of the node ##### # Soft link /tmp to /var/tmp (NFS /var must be no_root_squash) echo "/etc/beowulf/node_up: Soft link /tmp to /var/tmp." bpsh -n $NODE rmdir --ignore-fail-on-non-empty /tmp bpsh -n $NODE mkdir -p /var/tmp bpsh -n $NODE ln -s /var/tmp /tmp bpsh -n $NODE chmod 1777 /var/tmp # Clean out /tmp every boot bpsh -n $NODE /bin/rm -r -f /var/tmp/* bpsh -n $NODE /bin/rm -r -f /var/tmp/.* 2>/dev/null # Soft link /scratch to /home/node.$NODE (NFS /home must be = no_root_squash) echo "/etc/beowulf/node_up: Soft link /scratch to /home/node.$NODE." bpsh -n $NODE rmdir --ignore-fail-on-non-empty /scratch bpsh -n $NODE mkdir -p /home/node.$NODE bpsh -n $NODE ln -s /home/node.$NODE /scratch bpsh -n $NODE chmod 1777 /home/node.$NODE exit 0 #!/bin/sh # # /usr/lib/beoboot/bin/setup_fs # # Erik Hendriks <hen...@la...> # # $Id: setup_fs,v 1.4 2001/11/30 17:52:40 hendriks Exp $ # # This bit of code is a first stab at understanding fstab for mount. # It's a lot like mount dealing with its own fstab. # Differences with just allowing mount to chew on an fstab: # We can do fsck checks before attempting to mount. # We can (re)create file systems before mounting. # We can create mount points before mounting. # #------------------------------------------------------------------------= -- # Generic functions to do operations on varUseful functions #------------------------------------------------------------------------= -- # Usage: fsckfs node device fstype do_safefsck() { case $2 in /dev/ram*) echo "setup_fs: Hmmm...This appears to be a ramdisk. " echo -n "setup_fs: I'm going to try to try checking the " echo "filesystem (fsck) anyway." echo -n "setup_fs: If it is a RAM disk the following will " echo "fail harmlessly." ;; esac case $3 in ext*) bpsh -n $1 e2fsck -p $2 ; ret=3D$? if [ "$ret" =3D 1 ] ; then ret=3D0; fi ;; swap) bpsh -n $1 chkswap $2 ; ret=3D$? ;; *) ret=3D0;; esac [ "$ret" =3D 0 ] } do_fsck() { echo "setup_fs: Checking $2 (type=3D$3)..." case $2 in /dev/ram*) echo "setup_fs: Hmmm...This appears to be a ramdisk. " echo -n "setup_fs: I'm going to try to try checking the " echo "filesystem (fsck) anyway." echo -n "setup_fs: If it is a RAM disk the following will " echo "fail harmlessly." ;; esac case $3 in ext*) bpsh -n $1 e2fsck -y $2 ; ret=3D$? if [ "$ret" =3D 1 ] ; then ret=3D0; fi ;; swap) bpsh -n $1 chkswap $2 ; ret=3D$? ;; *) ret=3D0;; esac [ "$ret" =3D 0 ] } # Usage: do_mkfs node device fstype fssize do_mkfs() { echo "setup_fs: Creating $3 on $2..." case $3 in ext2) bpsh -n $1 mke2fs -q $2 $4 ; ret=3D$? ;; ext3) bpsh -n $1 mke2fs -q -j $2 $4 ; ret=3D$? ;; swap) bpsh -n $1 mkswap $2 $4 ; ret=3D$? ;; *) ret=3D0;; esac [ "$ret" =3D 0 ] } # Usage: load_fs node fstype load_fs () { if [ -z "`bpsh -n $1 grep $2 /proc/filesystems`" ] ; then modprobe --node $1 $2 fi } # Usage: do_mount node device mountpt fstype options do_mount() { # Load file system module for all fstypes so they can be mounted later if [ "$4" !=3D "swap" ] ; then load_fs $1 $4 fi # Don't mount devices with the "noauto" option if [ -n "`echo $5 | grep noauto`" ] ; then return fi echo "setup_fs: Mounting $2 on $3... (type=3D$4; options=3D$5)" case $4 in swap) bpsh -n $1 swapon $2 ;; # Defer mounts of network devices (host:export) without the "nolock" = option *) if [ -z "`echo $2 | grep :`" -o \ -n "`echo $5 | grep nolock`" ] ; then if bpsh -n $1 mount -nt $4 -o $5 $2 $3 ; then if [ "${mountpt:0:1}" =3D=3D "/" ] ; then echo "$device $mountpt $fstype $options" >> = $MTABFILE fi fi else echo "setup_fs: Mount deferred until lock daemon = running." fi ;; esac } # Usage: beoconfig tag [config_file] beoconfig() { local FILE=3D$2 if [ -z "$FILE" ] ; then FILE=3D${CONFIG} ; fi if [ ! -f ${FILE} ] ; then echo "Warning: ${FILE} file not found." >&2 return fi # These sed bits: # - strip spaces # - strip leading + trailing space # - if line starts with $1, strip off $1 and print it. sed -ne "s/#.*//" < ${FILE} \ -e "s/^[[:space:]]\+//;s/[[:space:]]\+\$//" \ -e "/^$1[[:space:]]/{s/^$1[[:space:]]\+//;p;}" } #------------------------------------------------------------------------= -- # Argument sanity checking if [ "$1" =3D "" ] ; then echo "Usage: setup_fs <nodenumber>" exit 1 fi echo "setup_fs: Configuring node filesystems..." NODE=3D$1 PATH=3D/sbin:/usr/sbin:$PATH:/usr/lib/beoboot/bin CONFIG=3D/etc/beowulf/config MASTER=3D`bpstat -a master` RAMDISK=3D/dev/ram3 FSCK=3D`beoconfig fsck` MKFS=3D`beoconfig mkfs` # Select which FSTAB to use. FSTAB=3D/etc/beowulf/fstab.$NODE if [ ! -r $FSTAB ] ; then FSTAB=3D/etc/beowulf/fstab fi echo "setup_fs: Using $FSTAB" # XXX We need a way to pick up per-node commands! # Control flags # # FSCK =3D # 0 =3D Don't touch anything, just try to mount. # 1 =3D Ok to fsck but don't do anything if it fails. # 2 =3D fsck and do mkfs if it fails. # 3 =3D skip fsck go straight to mkfs # # Sanity check FSCK (default =3D 1) case $FSCK in "never"|"safe"|"full") ;; "") FSCK=3Dsafe ;; *) echo 1>&2 "Invalid value '$FSCK' for fsck tag in $CONFIG." exit 1 ;; esac case $MKFS in "never"|"if_needed"|"always") ;; "") MKFS=3Dif_needed ;; *) echo 1>&2 "Invalid value '$MKFS' for mkfs tag in $CONFIG." exit 1 ;; esac if [ ! -f $FSTAB ] ; then echo 1>&2 "setup_fs: $FSTAB (file system table) is missing." exit 1 fi # Ok... This is one big nasty pipe line... Here's what this mess does: # * Use sed to remove comments. (starting with #) # * Run it all though eval to do variable substitutions. # * Go through all the lines doing: # + Ignore the empty lines # + Remove trailing slashes from the mount points # + Prepend a number that will allow us to sort the mount points. # * Sort the mount points # * On each point point (depending on the FSCK policy): # + fsck the file system # + if bad, possibly recreate the file system. # + mount the file system (defer network mounts w/o the "nolock" = option) # * Create /etc/fstab for the new node. # * Create /etc/mtab for the new node. MTABFILE=3D/tmp/.setup_fs.mtab.$$ if ! rm -f $MTABFILE ; then echo 1>&2 "setup_fs: $MTABFILE already exists and can't remove." exit 1 fi touch $MTABFILE FSTABFILE=3D/tmp/.setup_fs.fstab.$$ if ! rm -f $FSTABFILE ; then echo 1>&2 "setup_fs: $FSTABFILE already exists and can't remove." exit 1 fi touch $FSTABFILE cat $FSTAB | \ while read line ; do if [ -z "$line" -o "${line:0:1}" =3D "#" ] ; then echo $line >>$FSTABFILE else line=3D`eval echo "$line"` echo $line >>$FSTABFILE echo $line fi done | \ while read device mountpt fstype options junk ; do if [ -z "$options" ] ; then echo 1>&2 "Ignoring incomplete line: $device $mountpt = $fstype $options $junk" continue fi # Sanitize mount point... (squeeze multiple slashes, remove # any trailing slashes) mountpt=3D`echo $mountpt | sed -e 's!/\+!/!g' -e 's!/\+$!!'` slashct=3D`echo $mountpt | tr -cd / | wc -c` if [ -z $mountpt ] ; then mountpt=3D/ ; fi echo $slashct $device $mountpt $fstype $options done | \ sort -n | \ (while read slashct device mountpt fstype options junk ; do if [ -z "$options" ] ; then if [ -n "$device" ] ; then echo 1>&2 "Ignoring incomplete line: $device $mountpt $fstype $options = $junk" fi continue fi # Get a file system size option if it's there... fssize=3D`echo $options | sed -e 's/.*fs_size=3D\([0-9]\+\).*/\1/p;d'` options=3D`echo $options | sed -e 's/fs_size=3D[0-9]\+//g'` if [ -z "$options" ] ; then options=3Ddefaults; fi =20 # Everything gets a "/rootfs" prefix at this stage. Also we create the # mount points as needed. This requires that people have their fstab # in some resonable order. (It might be hard for us to sort it....) # see to it that the device node exists on the remote machine if [ "${device:0:4}" =3D=3D "/dev" ] ; then (cd / ; tar cf - $device) | bpsh -n $NODE tar xf - fi mknewfs=3D0 if [ $MKFS =3D "always" ]; then mknewfs=3D1 else case $FSCK in "never") ;; # No FSCK! "safe") if ! do_safefsck $NODE $device $fstype ; then echo 1>&2 "setup_fs: RAM disks fail FSCK, that's OK" echo 1>&2 "setup_fs: FSCK failure. (OK for RAM disks)" mknewfs=3D1 fi ;; "full") if ! do_fsck $NODE $device $fstype ; then echo 1>&2 "setup_fs: FSCK failure. (OK for RAM disks)" mknewfs=3D1 fi ;; esac fi =20 if [ $MKFS !=3D "never" -a "$mknewfs" =3D 1 ] ; then if ! do_mkfs $NODE $device $fstype $fssize ; then echo 1>&2 "Failed to create $fstype file system on $device." exit 1 fi fi # See to it that the mount point exists before trying to mount. if echo $mountpt | grep -q '^/' ; then if ! bpsh -n $NODE mkdir -p /rootfs$mountpt ; then echo 1>&2 "Failed to create $mountpt." exit 1 fi fi if ! do_mount $NODE $device /rootfs$mountpt $fstype $options ; then echo 1>&2 "Failed to mount $device on $mountpt." exit 1 fi done # Create fstab on the remote node... if ! bpsh -n $NODE mkdir -p /rootfs/etc ; then echo 1>&2 "Failed to create /etc." exit 1 fi if ! bpcp $FSTABFILE $NODE:/rootfs/etc/fstab ; then echo 1>&2 "Failed to create /etc/fstab." exit 1 fi rm -f $FSTABFILE # Finally, create mtab on the remote node... if ! bpcp $MTABFILE $NODE:/rootfs/etc/mtab ; then echo 1>&2 "Failed to create /etc/mtab." exit 1 fi rm -f $MTABFILE ) # Exit with status of this nutty pipeline. |