kluster Wiki

Tools for GridEngine+NFS administration on EC2

Brought to you by: cpupa, danielpovey, majortal, vpeddin

CustomizingImage2

Customizing your Image (Phase 2)

We'll assume that the AMI you just created is now ready. Execute the command below, replacing the AMI-ID with the one you just made:

ec2run ami-8eaf37e7 -g mycluster -k mycluster -z us-east-1c -t c3.large \
  -b "/dev/xvdb=ephemeral0"  -b "/dev/xvdc=ephemeral1"

Here, the -b option makes it attach the local storage ephemeral0 of the machine as device /dev/xvdb. When we create an image from this machine, this specification becomes part of the AMI and it will happen automatically. The command above will print out the instance-id (i-something) of the instance, along with other information. It is good practice to tag the instance with a meaningful name:

ec2tag i-b7a975d8 --tag Name=customizing_phase_2

Type ec2din i-b7a975d8 (but use your new instance-id) to get the internet address (FQDN) of the instance, and ssh into it as the admin user (we cannot yet login as root user): something like,

ssh -i ~/.ssh/mycluster.pem admin@ec2-54-234-152-143.compute-1.amazonaws.com

This will hang or say "connection refused" until your machine is up, but retry and you should be able to get to it after a minute or so. This works even though we previously deleted this key from the authorized_keys on the image, because the start-up scripts on the image add it back.

If you were able to ssh to the instance, then nothing serious went wrong. This means you probably don't need the previous instance you made, any more. If you can't remember the instance-id then ec2din with no arguments will help you to find it. The terminate the old instance: from your local machine, type something like the following.

# ec2kill i-7877a019 
INSTANCE    i-7877a019  stopped terminated

Further configuration...

OK, now we're ready to continue configuring the image. On the new instance, type:

sudo su
apt-get install autofs nfs-kernel-server lockfile-progs lvm2 curl -y

Some of the next installs are a little finicky with regard to the hostname of the local machine, so we'll set the hostname at this point.

hostname master
echo master > /etc/hostname

Now edit the file /etc/hosts so the first line looks like this:

127.0.0.1       master localhost localhost.localdomain

The next thing to do is to install NIS. The first line below is just to make it harder for you to continue blindly if you failed to set the hostname properly.

[ `hostname` != master ] && exit
echo ypserver master > /etc/yp.conf
apt-get install nis -y

This is one of those Debian interactive installs that brings up what looks like a DOS menu on your screen (if your terminal is powerful enough). It will ask you for the NIS domain; you can leave this at the value master. It will query about a conflict with the file /etc/yp.conf and ask you what you want to do. Choose N, which is to keep our version. We only set yp.conf to stop the installation process from hanging for an annoyingly long time (it will still hang for a while). The installation will look like it failed:

Setting NIS domainname to: kluster.
Starting NIS services: ypbindbinding to YP server...........................................failed (backgrounded).
. ok

The next installation is interactive:

apt-get install gridengine-client -y

To the question "Install SGE Automatically?" reply "Yes". The "SGE cell name" should be left as "default", and the "SGE master hostname" should be set to "master". Next we install the "gridengine master" package. This is the queue manager, and it will run just on the "master" node, not on the regular nodes:

apt-get install gridengine-master -y

We next install the "execution client" part of GridEngine:

apt-get install gridengine-exec -y

The following packages were things I needed for various reasons; you may find it useful to install them all now, in case you later need one of them.

apt-get install -y gawk automake1.10 libtool zlib1g-dev gfortran screen ntp \
   sudo rsync pkg-config gdb iftop libxml-simple-perl subversion \
   libatlas3-base g++ patch bzip2

Enable ability to ssh as root

Next we need to set up one more thing so that we can ssh as root. On startup, Debian 7 inserts a command into /root/.ssh/authorized_keys entries which disables root login.

# cat /root/.ssh/authorized_keys
no-port-forwarding,no-agent-forwarding,no-X11-forwarding,command="echo 'Please login as the user \"admin\" rather than the user \"root\".';echo;sleep 10" ssh-rsa <snip long rsa entry>

We need to remove the ,command="echo 'Please login as the user \"admin\" rather than the user \"root\".';echo;sleep 10" part, so that it looks like:

no-port-forwarding,no-agent-forwarding,no-X11-forwarding ssh-rsa <snip long rsa entry>

This could be done manually, or via the commandline:

sed -i 's/,command=.*\bssh-rsa\b/ ssh-rsa/g' /root/.ssh/authorized_keys

Edit /root/.ssh/authorized_keys, and then do:

service ssh restart

Next, from the instance, type

ssh master

just to verify that we can still ssh to ourself as root without a password. You should get a prompt. Just type "exit" to go back to your original session. If there is an error you'll have to figure out what went wrong.

Then, from a separate window on your local machine, verify that you can now ssh to the instance as root: something like

ssh -i ~/.ssh/mycluster.pem root@ec2-54-235-5-54.compute-1.amazonaws.com

Push scripts from local machine to image

Now you are ready to transfer a large number of config files and scripts from your "kluster" distribution on the local machine, to the instance. If you want to see what kinds of configuration changes are taking place, the following command may be useful. Run this from your local machine, in the kluster directory, and use the internet name of your actual instance:

bin/push-configs.sh ~/.ssh/mycluster.pem ec2-54-234-152-143.compute-1.amazonaws.com \
  `cat scripts/root/config_files`

Complete service setup

We now have to finish a few things before creating the image. First we have to initialize the NIS database. Do as follows on the instance:

shadowconfig off
cd /var/yp
/usr/lib/yp/ypinit -m

The ypinit command requires user interaction; you have to press ctrl-D and then y. There will be some harmless warnings about failed to send 'clear' to local ypserv: RPC: Program not registered. Next, run service nis restart:

# service nis restart
Stopping NIS services: ypbind ypserv ypppasswdd ypxfrd.
Starting NIS services: ypserv yppasswdd ypxfrd ypbind.

If the output does not look like above, check that your /etc/hosts file has a line like 127.0.0.1 master localhost localhost.localdomain, that the command hostname prints out master, and that /etc/hostname says master.

Next, execute the following commands; this ensures that the user and group information will be propagated from the master via NIS.

echo '+::::::' >> /etc/passwd
echo "+:" >> /etc/group

We added some init scripts in /etc/init.d, so we need to register them with Debian as follows:

insserv -d kluster-misc-tasks kluster-mktemp kluster-set-hostname \
     mem-killer gridengine-exec kluster-configure-queue

This should not produce any output. It sets up soft links to the init scripts in /etc/init.d/, from the directories /etc/rcN.d for different runlevels. The directory for runlevel 4 (normal startup) should look as follows; this lets you know what order things will be started up in:

# ls /etc/rc4.d/
README                   S12rpcbind
S01bootlogs              S13nfs-common
S01cloud-init-local      S13nis
S01motd                  S14autofs
S01rsyslog               S14nfs-kernel-server
S01sudo                  S15cron
S02cloud-init            S15gridengine-master
S02dbus                  S16kluster-configure-queue
S02exim4                 S17kluster-misc-tasks
S02mem-killer            S18gridengine-exec
S02ntp                   S19cloud-final
S02rsync                 S19kluster-mktemp
S03cloud-config          S19rc.local
S03ssh                   S19rmnologin
S04kluster-set-hostname

(Note: the insserv command uses the "init info" in the comments at the top of the scripts in /etc/init.d to work out the dependencies between init jobs, which determines the order).

At this point we can check that a few things are working before we shut down and make the image. First we check that NIS (which used to be called Yellow Pages/YP) is working:

# ypcat -k auto.master
/export auto.export     -rw,nfsvers=3,intr,rsize=8192,wsize=8192,timeo=1000,retrans=5,bg,retry=5,proto=tcp,actimeo=10
/home auto.home       -rw,nfsvers=3,intr,rsize=8192,wsize=8192,timeo=1000,retrans=5,bg,retry=5,proto=tcp,actimeo=10

Next we check that GridEngine is working OK:

# service gridengine-master restart
Restarting Sun Grid Engine Master Scheduler: sge_qmaster.
# service gridengine-exec restart
Restarting Sun Grid Engine Execution Daemon: sge_execd.
# qhost -q
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
master                  lx26-amd64      2  0.02    7.5G  280.8M    2.9G     0.0
#

If anything went wrong with NIS or GridEngine, check that the hostname and /etc/hosts are correct. Unfortunately there are many other things that can go wrong. With GridEngine in particular, if the initial installation goes wrong, e.g. the /etc/hosts or hostname was wrong at the time of installation, in my experience the only way to fix it is to start from scratch with an image that has never had GridEngine installed on it.

Next, we need to make some configuration changes in the queue. To make this easier I previously saved some configuration information from my own queue setup (see /root/queue/README for more info). On the instance, do:

cd /root/queue
qconf -as master
( echo 'group_name @allhosts'; echo "hostlist `qconf -sel`" ) > foo
qconf -Ahgrp foo
qconf -Ap sp_smp
qconf -Aq sq_all.q 
qconf -Mc sc
qconf -Msconf ssconf
cp sconf global; qconf -Mconf global; rm global

These commands set various configuration parameters of the queue, to values that I generally work with and that should work well for Kaldi system building. If you are going to administer a GridEngine cluster you should probably become familiar with GridEngine administration. Commands and associated options that you will likely use a lot include qstat, qhost -q, qconf -mq, qconf -dh, qconf -dh, qconf -ae, qconf -de, qconf -mc, qconf -mconf. Type man qconf for more information.

Create image

Now we will create an image from the instance. From your local machine, stop the image:

# ec2stop  i-b7a975d8 
INSTANCE    i-b7a975d8  running stopping

Now create an AMI from your image:

# ec2cim  i-b7a975d8  -n 'customized_phase_2_try1'
IMAGE   ami-74128a1d

Previous: Customize your image (Phase 1)
Next: Spawning the master node
Up: Kluster Wiki

Wiki: CustomizingImage1
Wiki: Home
Wiki: SettingConfig
Wiki: SpawningMaster