xcat fails to install cuda drivers for netboot install
We had previously successfully built cuda drivers with the following script.
proc_dir= sys_dir= # exit trap to undo any mounts done earlier function finish { set +e; [ -n $proc_dir ] && umount $proc_dir; [ -n $sys_dir ] && umount $sys_dir; } trap finish EXIT export DEBIAN_FRONTEND=noninteractive unset DEBIAN_HAS_FRONTEND unset DEBCONF_REDIR unset DEBCONF_OLD_FD_BASE unset ARCH mount -o bind /sys $installroot/sys && sys_dir=$installroot/sys mount -o bind /proc $installroot/proc && proc_dir=$installroot/proc chroot $installroot \ apt-get -q -y --force-yes -o Dpkg::Options::="--force-confold" install cuda
After we updated our source code mirrors to the latest version kernel
(3.16.0-28-generic, and other security updates) from ubuntu and updated
the underlying os of the computer we were doing the installs on we now get
the following error:
Adding system user `nvidia-persistenced' (UID 104) ... Adding new group `nvidia-persistenced' (GID 106) ... Adding new user `nvidia-persistenced' (UID 104) with group `nvidia-persistenced' ... chfn: PAM: System error adduser: `/usr/bin/chfn -f NVIDIA Persistence Daemon nvidia-persistenced' returned error code 1. Exiting. dpkg: error processing package nvidia-340 (--configure): subprocess installed post-installation script returned error exit status 1 dpkg: dependency problems prevent configuration of nvidia-340-uvm: nvidia-340-uvm depends on nvidia-340 (>= 340.50); however: Package nvidia-340 is not configured yet.
I attempted to isolate the problem by trying to just issue the chfn call
under chroot and I get the following:
/opt/xcat/share/xcat/netboot/ubuntu# chroot /install/netboot/ubuntu14.10/ppc64el/tulgpu-0000-netboot-compute/rootimg /usr/bin/chfn -f "NVIDIA Persistence Daemon" nvidia-persistenced chfn: PAM: System error
Which I suspect is the same error that the cuda post install script is
running into. Something about the new os and updates is now preventing the
chfn function from excuting in the chroot environment needed to install
stuff to create network boot image.
Please advise how one can oversome this error for this and any other
apt-get post install step that we run into.
The cuda drivers and development envirnoment is a bit too time consuming
to install every time that we boot.
we added the following package to the computer we were running the genimage on:
apt-get install libpam-chroot
And then the cuda drivers succeeded in installing in the chroot environment.
So with the new os update on ubuntu this package appears to now be required.
It worked for a few tries, and now it is not working again...
So, something other than the libpam-chroot managed to temporary clear the
problem. I'm not sure what that is yet...
Ralph Bellofatto
IBM TJ Watson Research
1-914-945-3321
ralphbel@us.ibm.com
From: "ralph bellofatto" ralphbel@users.sf.net
To: "[xcat:bugs] " 4486@bugs.xcat.p.re.sf.net
Date: 12/19/2014 05:30 PM
Subject: [xcat:bugs] #4486 xcat fails to install cuda drivers for
netboot install after os update
we added the following package to the computer we were running the genimage
on:
apt-get install libpam-chroot
And then the cuda drivers succeeded in installing in the chroot
environment.
So with the new os update on ubuntu this package appears to now be
required.
[bugs:#4486] xcat fails to install cuda drivers for netboot install after
os update
Status: open
Milestone: 2.9.1
Created: Thu Dec 18, 2014 08:31 PM UTC by ralph bellofatto
Last Updated: Thu Dec 18, 2014 08:31 PM UTC
Owner: nobody
xcat fails to install cuda drivers for netboot install
We had previously successfully built cuda drivers with the following
script.
proc_dir=
sys_dir=
exit trap to undo any mounts done earlier
function finish {
set +e;
[ -n $proc_dir ] && umount $proc_dir;
[ -n $sys_dir ] && umount $sys_dir;
}
trap finish EXIT
export DEBIAN_FRONTEND=noninteractive
unset DEBIAN_HAS_FRONTEND
unset DEBCONF_REDIR
unset DEBCONF_OLD_FD_BASE
unset ARCH
mount -o bind /sys $installroot/sys && sys_dir=$installroot/sys
mount -o bind /proc $installroot/proc && proc_dir=$installroot/proc
chroot $installroot \ apt-get -q -y --force-yes -o Dpkg::Options::="--force-confold" install
cuda
After we updated our source code mirrors to the latest version kernel
(3.16.0-28-generic, and other security updates) from ubuntu and updated
the underlying os of the computer we were doing the installs on we now get
the following error:
Adding system user
nvidia-persistenced' (UID 104) ... Adding new group
nvidia-persistenced' (GID 106) ...Adding new user
nvidia-persistenced' (UID 104) with group
nvidia-persistenced' ...chfn: PAM: System error
adduser: `/usr/bin/chfn -f NVIDIA Persistence Daemon nvidia-persistenced'
returned error code 1. Exiting.
dpkg: error processing package nvidia-340 (--configure):
subprocess installed post-installation script returned error exit status 1
dpkg: dependency problems prevent configuration of nvidia-340-uvm:
nvidia-340-uvm depends on nvidia-340 (>= 340.50); however:
Package nvidia-340 is not configured yet.
I attempted to isolate the problem by trying to just issue the chfn call
under chroot and I get the following:
/opt/xcat/share/xcat/netboot/ubuntu#
chroot /install/netboot/ubuntu14.10/ppc64el/tulgpu-0000-netboot-compute/rootimg /usr/bin/chfn
-f "NVIDIA Persistence Daemon" nvidia-persistenced
chfn: PAM: System error
Which I suspect is the same error that the cuda post install script is
running into. Something about the new os and updates is now preventing the
chfn function from excuting in the chroot environment needed to install
stuff to create network boot image.
Please advise how one can oversome this error for this and any other
apt-get post install step that we run into.
The cuda drivers and development envirnoment is a bit too time consuming
to install every time that we boot.
Sent from sourceforge.net because you indicated interest in
https://sourceforge.net/p/xcat/bugs/4486/
To unsubscribe from further messages, please visit
https://sourceforge.net/auth/subscriptions/
Related
Bugs:
#4486It is great that you figured this out, closing this bug. We will need to document this when we add the support for Nvidia GPU.
If you search the error "chfn: PAM: System error" in google, there are a lot of discussions about this error in Docker configuration, the procedure of xCAT diskless genimage is similar with the Docker, i.e., run os in a chroot environment. According the the page http://stackoverflow.com/questions/25193161/chfn-pam-system-error-intermittently-in-docker-hub-builds, seems this problem is caused by some recent kernel updates, someone recommended a workaround "ln -s -f /bin/true /usr/bin/chfn", could you have a try?
BTW, some people already opened several bugs against debian/ubuntu for this problem. Like https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=763391 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=745082
From Ralph Bellofatto:
The workaround for this problem works.
this leaves us with a problem problem that the chfn function did not do what it was supposed to do.
Pursuing a real fix for this via an LTC bug report is the next step. Can you file such a LTC bug report for this issue?
The error actually happened because the chroot was mounted over NFSv4, and the NFSv4 server had incorrect domain name configuration.
Then, the NFSv4 idmapd didn't match 'localdomain' (server) with cluster.com (client), resulting in the chfn binary (and others) being owned by nobody/nogroup, this combined with the suid bit of that binary, resulted in kernel denying it during PAM/audit check (failure occurs right after the socket/sendto/recvfrom syscalls from PAM to kernel audit).
Solution was to configure the domain name correctly on the server.
Other possible workarounds were:
- Use NFSv3 (which has no Name-ID Mapping / idmapd)
- Clear the suid bit
More details on
https://bugs.launchpad.net/ubuntu/+source/shadow/+bug/1408589