Menu

#4486 xcat fails to install cuda drivers for netboot install after os update

2.9.1
wont-fix
None
ubuntu
8
2015-01-27
2014-12-18
No

xcat fails to install cuda drivers for netboot install

We had previously successfully built cuda drivers with the following script.

proc_dir=
sys_dir=
# exit trap to undo any mounts done earlier
function finish {
    set +e;
    [ -n $proc_dir ] && umount $proc_dir;
    [ -n $sys_dir ] && umount $sys_dir;
} 
trap finish EXIT

export DEBIAN_FRONTEND=noninteractive
unset  DEBIAN_HAS_FRONTEND
unset  DEBCONF_REDIR
unset  DEBCONF_OLD_FD_BASE
unset  ARCH

mount -o bind /sys $installroot/sys && sys_dir=$installroot/sys
mount -o bind /proc $installroot/proc && proc_dir=$installroot/proc

chroot $installroot \
  apt-get -q -y --force-yes  -o Dpkg::Options::="--force-confold" install  cuda

After we updated our source code mirrors to the latest version kernel
(3.16.0-28-generic, and other security updates) from ubuntu and updated
the underlying os of the computer we were doing the installs on we now get
the following error:

Adding system user `nvidia-persistenced' (UID 104) ...
Adding new group `nvidia-persistenced' (GID 106) ...
Adding new user `nvidia-persistenced' (UID 104) with group `nvidia-persistenced' ...
chfn: PAM: System error
adduser: `/usr/bin/chfn -f NVIDIA Persistence Daemon nvidia-persistenced' returned error code 1. Exiting.
dpkg: error processing package nvidia-340 (--configure):
subprocess installed post-installation script returned error exit status 1
dpkg: dependency problems prevent configuration of nvidia-340-uvm:
nvidia-340-uvm depends on nvidia-340 (>= 340.50); however:
Package nvidia-340 is not configured yet.

I attempted to isolate the problem by trying to just issue the chfn call
under chroot and I get the following:

/opt/xcat/share/xcat/netboot/ubuntu# chroot /install/netboot/ubuntu14.10/ppc64el/tulgpu-0000-netboot-compute/rootimg   /usr/bin/chfn -f "NVIDIA Persistence Daemon" nvidia-persistenced
chfn: PAM: System error

Which I suspect is the same error that the cuda post install script is
running into. Something about the new os and updates is now preventing the
chfn function from excuting in the chroot environment needed to install
stuff to create network boot image.

Please advise how one can oversome this error for this and any other
apt-get post install step that we run into.

The cuda drivers and development envirnoment is a bit too time consuming
to install every time that we boot.

1 Attachments

Related

Bugs: #4486

Discussion

  • ralph bellofatto

    we added the following package to the computer we were running the genimage on:

    apt-get install libpam-chroot

    And then the cuda drivers succeeded in installing in the chroot environment.

    So with the new os update on ubuntu this package appears to now be required.

     
    • ralph bellofatto

      It worked for a few tries, and now it is not working again...

      So, something other than the libpam-chroot managed to temporary clear the
      problem. I'm not sure what that is yet...

      Ralph Bellofatto
      IBM TJ Watson Research
      1-914-945-3321
      ralphbel@us.ibm.com

      From: "ralph bellofatto" ralphbel@users.sf.net
      To: "[xcat:bugs] " 4486@bugs.xcat.p.re.sf.net
      Date: 12/19/2014 05:30 PM
      Subject: [xcat:bugs] #4486 xcat fails to install cuda drivers for
      netboot install after os update

      we added the following package to the computer we were running the genimage
      on:

      apt-get install libpam-chroot

      And then the cuda drivers succeeded in installing in the chroot
      environment.

      So with the new os update on ubuntu this package appears to now be
      required.

      [bugs:#4486] xcat fails to install cuda drivers for netboot install after
      os update

      Status: open
      Milestone: 2.9.1
      Created: Thu Dec 18, 2014 08:31 PM UTC by ralph bellofatto
      Last Updated: Thu Dec 18, 2014 08:31 PM UTC
      Owner: nobody

      xcat fails to install cuda drivers for netboot install

      We had previously successfully built cuda drivers with the following
      script.

      proc_dir=
      sys_dir=

      exit trap to undo any mounts done earlier

      function finish {
      set +e;
      [ -n $proc_dir ] && umount $proc_dir;
      [ -n $sys_dir ] && umount $sys_dir;
      }
      trap finish EXIT

      export DEBIAN_FRONTEND=noninteractive
      unset DEBIAN_HAS_FRONTEND
      unset DEBCONF_REDIR
      unset DEBCONF_OLD_FD_BASE
      unset ARCH

      mount -o bind /sys $installroot/sys && sys_dir=$installroot/sys
      mount -o bind /proc $installroot/proc && proc_dir=$installroot/proc

      chroot $installroot \ apt-get -q -y --force-yes -o Dpkg::Options::="--force-confold" install
      cuda

      After we updated our source code mirrors to the latest version kernel
      (3.16.0-28-generic, and other security updates) from ubuntu and updated
      the underlying os of the computer we were doing the installs on we now get
      the following error:

      Adding system user nvidia-persistenced' (UID 104) ... Adding new groupnvidia-persistenced' (GID 106) ...
      Adding new user nvidia-persistenced' (UID 104) with groupnvidia-persistenced' ...
      chfn: PAM: System error
      adduser: `/usr/bin/chfn -f NVIDIA Persistence Daemon nvidia-persistenced'
      returned error code 1. Exiting.
      dpkg: error processing package nvidia-340 (--configure):
      subprocess installed post-installation script returned error exit status 1
      dpkg: dependency problems prevent configuration of nvidia-340-uvm:
      nvidia-340-uvm depends on nvidia-340 (>= 340.50); however:
      Package nvidia-340 is not configured yet.

      I attempted to isolate the problem by trying to just issue the chfn call
      under chroot and I get the following:

      /opt/xcat/share/xcat/netboot/ubuntu#
      chroot /install/netboot/ubuntu14.10/ppc64el/tulgpu-0000-netboot-compute/rootimg /usr/bin/chfn
      -f "NVIDIA Persistence Daemon" nvidia-persistenced
      chfn: PAM: System error

      Which I suspect is the same error that the cuda post install script is
      running into. Something about the new os and updates is now preventing the
      chfn function from excuting in the chroot environment needed to install
      stuff to create network boot image.

      Please advise how one can oversome this error for this and any other
      apt-get post install step that we run into.

      The cuda drivers and development envirnoment is a bit too time consuming
      to install every time that we boot.

      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/xcat/bugs/4486/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       

      Related

      Bugs: #4486

  • Guang Cheng Li

    Guang Cheng Li - 2014-12-22

    It is great that you figured this out, closing this bug. We will need to document this when we add the support for Nvidia GPU.

     
  • Guang Cheng Li

    Guang Cheng Li - 2014-12-22
    • status: open --> wont-fix
    • assigned_to: Guang Cheng Li
    • component: unknown --> ubuntu
     
  • Guang Cheng Li

    Guang Cheng Li - 2014-12-24

    From Ralph Bellofatto:

    The workaround for this problem works.

    this leaves us with a problem problem that the chfn function did not do what it was supposed to do.

    Pursuing a real fix for this via an LTC bug report is the next step. Can you file such a LTC bug report for this issue?

     
  • Mauricio Faria de Oliveira

    The error actually happened because the chroot was mounted over NFSv4, and the NFSv4 server had incorrect domain name configuration.

    Then, the NFSv4 idmapd didn't match 'localdomain' (server) with cluster.com (client), resulting in the chfn binary (and others) being owned by nobody/nogroup, this combined with the suid bit of that binary, resulted in kernel denying it during PAM/audit check (failure occurs right after the socket/sendto/recvfrom syscalls from PAM to kernel audit).

    Solution was to configure the domain name correctly on the server.

    [root@bgxcat mauricfo]# cat /etc/sysconfig/network
    NETWORKING=yes
    #HOSTNAME=bgxcat
    HOSTNAME=bgxcat.cluster.com
    

    Other possible workarounds were:
    - Use NFSv3 (which has no Name-ID Mapping / idmapd)
    - Clear the suid bit

    More details on
    https://bugs.launchpad.net/ubuntu/+source/shadow/+bug/1408589