xCAT / Bugs / #4486 xcat fails to install cuda drivers for netboot install after os update

ralph bellofatto - 2014-12-19

we added the following package to the computer we were running the genimage on:

apt-get install libpam-chroot

And then the cuda drivers succeeded in installing in the chroot environment.

So with the new os update on ubuntu this package appears to now be required.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- ralph bellofatto - 2014-12-22
  
  It worked for a few tries, and now it is not working again...
  
  So, something other than the libpam-chroot managed to temporary clear the
  problem. I'm not sure what that is yet...
  
  Ralph Bellofatto
  IBM TJ Watson Research
  1-914-945-3321
  ralphbel@us.ibm.com
  
  From: "ralph bellofatto" ralphbel@users.sf.net
  To: "[xcat:bugs] " 4486@bugs.xcat.p.re.sf.net
  Date: 12/19/2014 05:30 PM
  Subject: [xcat:bugs] #4486 xcat fails to install cuda drivers for
  netboot install after os update
  
  we added the following package to the computer we were running the genimage
  on:
  
  apt-get install libpam-chroot
  
  And then the cuda drivers succeeded in installing in the chroot
  environment.
  
  So with the new os update on ubuntu this package appears to now be
  required.
  
  [bugs:#4486] xcat fails to install cuda drivers for netboot install after
  os update
  
  Status: open
  Milestone: 2.9.1
  Created: Thu Dec 18, 2014 08:31 PM UTC by ralph bellofatto
  Last Updated: Thu Dec 18, 2014 08:31 PM UTC
  Owner: nobody
  
  xcat fails to install cuda drivers for netboot install
  
  We had previously successfully built cuda drivers with the following
  script.
  
  proc_dir=
  sys_dir=
  
  exit trap to undo any mounts done earlier
  
  function finish {
  set +e;
  [ -n $proc_dir ] && umount $proc_dir;
  [ -n $sys_dir ] && umount $sys_dir;
  }
  trap finish EXIT
  
  export DEBIAN_FRONTEND=noninteractive
  unset DEBIAN_HAS_FRONTEND
  unset DEBCONF_REDIR
  unset DEBCONF_OLD_FD_BASE
  unset ARCH
  
  mount -o bind /sys $installroot/sys && sys_dir=$installroot/sys
  mount -o bind /proc $installroot/proc && proc_dir=$installroot/proc
  
  chroot $installroot \ apt-get -q -y --force-yes -o Dpkg::Options::="--force-confold" install
  cuda
  
  After we updated our source code mirrors to the latest version kernel
  (3.16.0-28-generic, and other security updates) from ubuntu and updated
  the underlying os of the computer we were doing the installs on we now get
  the following error:
  
  Adding system user nvidia-persistenced' (UID 104) ... Adding new groupnvidia-persistenced' (GID 106) ...
  Adding new user nvidia-persistenced' (UID 104) with groupnvidia-persistenced' ...
  chfn: PAM: System error
  adduser: `/usr/bin/chfn -f NVIDIA Persistence Daemon nvidia-persistenced'
  returned error code 1. Exiting.
  dpkg: error processing package nvidia-340 (--configure):
  subprocess installed post-installation script returned error exit status 1
  dpkg: dependency problems prevent configuration of nvidia-340-uvm:
  nvidia-340-uvm depends on nvidia-340 (>= 340.50); however:
  Package nvidia-340 is not configured yet.
  
  I attempted to isolate the problem by trying to just issue the chfn call
  under chroot and I get the following:
  
  /opt/xcat/share/xcat/netboot/ubuntu#
  chroot /install/netboot/ubuntu14.10/ppc64el/tulgpu-0000-netboot-compute/rootimg /usr/bin/chfn
  -f "NVIDIA Persistence Daemon" nvidia-persistenced
  chfn: PAM: System error
  
  Which I suspect is the same error that the cuda post install script is
  running into. Something about the new os and updates is now preventing the
  chfn function from excuting in the chroot environment needed to install
  stuff to create network boot image.
  
  Please advise how one can oversome this error for this and any other
  apt-get post install step that we run into.
  
  The cuda drivers and development envirnoment is a bit too time consuming
  to install every time that we boot.
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/xcat/bugs/4486/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  Related
  
  Bugs: ~~#4486~~
  
  alternate
  
  graycol.gif
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Guang Cheng Li - 2014-12-22

It is great that you figured this out, closing this bug. We will need to document this when we add the support for Nvidia GPU.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Guang Cheng Li - 2014-12-22

status: open --> wont-fix

assigned_to: Guang Cheng Li

component: unknown --> ubuntu
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Guang Cheng Li - 2014-12-23

If you search the error "chfn: PAM: System error" in google, there are a lot of discussions about this error in Docker configuration, the procedure of xCAT diskless genimage is similar with the Docker, i.e., run os in a chroot environment. According the the page http://stackoverflow.com/questions/25193161/chfn-pam-system-error-intermittently-in-docker-hub-builds, seems this problem is caused by some recent kernel updates, someone recommended a workaround "ln -s -f /bin/true /usr/bin/chfn", could you have a try?

BTW, some people already opened several bugs against debian/ubuntu for this problem. Like https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=763391 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=745082

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Guang Cheng Li - 2014-12-24

From Ralph Bellofatto:

The workaround for this problem works.

this leaves us with a problem problem that the chfn function did not do what it was supposed to do.

Pursuing a real fix for this via an LTC bug report is the next step. Can you file such a LTC bug report for this issue?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mauricio Faria de Oliveira - 2015-01-27

The error actually happened because the chroot was mounted over NFSv4, and the NFSv4 server had incorrect domain name configuration.

Then, the NFSv4 idmapd didn't match 'localdomain' (server) with cluster.com (client), resulting in the chfn binary (and others) being owned by nobody/nogroup, this combined with the suid bit of that binary, resulted in kernel denying it during PAM/audit check (failure occurs right after the socket/sendto/recvfrom syscalls from PAM to kernel audit).

Solution was to configure the domain name correctly on the server.

[root@bgxcat mauricfo]# cat /etc/sysconfig/network NETWORKING=yes #HOSTNAME=bgxcat HOSTNAME=bgxcat.cluster.com

Other possible workarounds were:
- Use NFSv3 (which has no Name-ID Mapping / idmapd)
- Clear the suid bit

More details on
https://bugs.launchpad.net/ubuntu/+source/shadow/+bug/1408589
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

xcat fails to install cuda drivers for netboot install after os update

An extreme cluster/cloud administration toolkit

Milestone

Searches

Help

#4486 xcat fails to install cuda drivers for netboot install after os update

Related

Discussion

exit trap to undo any mounts done earlier

Related