From: <ha...@no...> - 2002-05-09 10:25:30
|
Hi, I investigated problems with autofs on bproc node and I believe there is deadlock caused by interference of the way bproc handles process groups and autofs use them. Here is a minimalistic way to invoke the problem: modprobe --node 1 autofs bpsh 1 mkdir -p /usr/lib/autofs for x in /usr/lib/autofs/*; do bpcp $x 1:$x; done bpsh 1 touch /etc/yyy bpsh 1 automount /xxx file /etc/yyy bpsh 1 ps|grep 'automount' ... 22113 ? 00:00:00 automount bpsh 1 strace -p 22113 And in another window: bpsh 1 ls /xxx/zzz If you also have this problem, ls hangs, only -9 kills automount and strace prints: read(4, "\3\0\0\0\0\0\0\0\2\0\0\0\3\0\0\0zzz\0\0\0\0\0\0\0\0\0\0"..., 272) = 272 chdir("/xxx") = 0 lstat64("zzz", With a lot of guesswork (as I am no expert in this) I deduced this scenario: 1. automount process finds out its own process group using getpgrp() 2. it mounts itself, using mount executable with pgrp option set to the result of getpgrp() 3. kernel performs mount and stores pgrp as oz_pgrp: processes with this 'magic' pgrp can see raw directories instead of automounted ones 4. when automounted subdirectory is accessed, kernel writes pipe to automount asking it to mount what should be seen there 5. automount tests whether automounted subdirectory exists; as it belongs to oz_pgrp, it should see raw directories 6. kernel fails to recognize that automount belongs to oz_pgrp and instead of showing raw directories, it wants to ask automount itself to deliver what should be seen. Deadlock. I believe that the problem is that getpgrp() on bproc node returns something else than current->pgrp tested inside the kernel on bproc node in autofs_oz_mode(). In more details, the scenario looks like this: 1. in autofs-4.0.0pre10/daemon/automount.c: /* Make our own process group for "magic" reason: processes that share our pgrp see the raw filesystem behine the magic. So if we are a submount, don't change -- otherwise we won't be able to actually perform the mount. A pgrp is also useful for controlling all the child processes we generate. */ if ( !submount && setpgrp() ) { syslog(LOG_CRIT, "setpgrp: %m"); exit(1); } my_pgrp = getpgrp(); (I am not sure whether setpgrp() is called or not but I think it might not matter now. It may however cause more problems beyond the one described here.) 2. still in autofs-4.0.0pre10/daemon/automount.c, in mount_autofs(...): sprintf(options, "fd=%d,pgrp=%u,minproto=2,maxproto=%d", pipefd[1], (unsigned)my_pgrp, AUTOFS_MAX_PROTO_VERSION); sprintf(our_name, "automount(pid%u)", (unsigned)my_pid); if (spawnl(LOG_DEBUG, PATH_MOUNT, PATH_MOUNT, "-t", "autofs", "-o", options, our_name, path, NULL) != 0) { syslog(LOG_CRIT, "cannot find autofs in kernel"); 3. in kernel in linux/fs/autofs/inode.c, autofs_read_super(): if ( parse_options(data,&pipefd,&root_inode->i_uid,&root_inode->i_gid,&s bi->oz_pgrp,&minproto,&maxproto) ) { printk("autofs: called with bogus options\n"); goto fail_dput; } ("magic" group got to oz_pgrp) 4. in kernel in system call invoked by ls: kernel writes pipe to automount process, asking it to arrange things in /xxx/zzz. System call does not return to ls until automount does its work (and therefore does never return). 5. in autofs-4.0.0pre10/daemon/automount.c: static int handle_packet_missing(...) ... chdir(ap.path); if ( lstat(pkt->name,&st) == -1 || (S_ISDIR(st.st_mode) && st.st_dev == ap.dev) ) { /* Need to mount or symlink */ (lstat() should see raw directories, but will not and will hang) 6. in kernel in fs/autofs/autofs_i.h: /* autofs_oz_mode(): do we see the man behind the curtain? (The processes which do manipulations for us in user space sees the raw filesystem without "magic".) */ static inline int autofs_oz_mode(struct autofs_sb_info *sbi) { return sbi->catatonic || current->pgrp == sbi->oz_pgrp; } (automount should be recognized as "magic" with autofs_oz_mode()==1 but instead gets the same treatment as ls in step 4. above. There is probably kernel lock around autofs things to happen, or maybe kernel even sends request to autofs via pipe - in any case it have to deadlock as automount still waits for lstat()) ******************************* I may be wrong with my analysis, I am no expert on any of the things involved (bproc, autofs). Please correct me if I am wrong. If I am right, there are several possible ways to avoid deadlock: - Make modified autofs.o which is aware of bproc-related process group tricks - autofs_oz_mode() should test for the same value which is returned via getpgrp() (hope this can avoid node-head-node communication). We should also verify whether setpgrp() used in automount.c would work as expected. - Start automount outside the distributed PID space. I am not sure how to do this, bproc is damn good in not letting you escape :-) - we could modify /etc/inittab on node and signal init process and have our automount process created this way? Any opinions and suggestions are more than welcome, especially comments on current handling of process groups in bproc (I know nothing about it). Thanks and Best Regards Vaclav Hanzl ========= copy of my original message on beowulf maillist: ======== Subject: autofs mount on bproc node? From: hanzl To: be...@be... Date: Wed, 08 May 2002 17:18:39 +0200 Hi, any of you great gurus managed to use autofs on bproc nodes? I am pushing hard, but at this moment it looks like hitting concrete wall with my head... any help would be more than welcome. My nodes already run NFS client, NFS server and syslogd/klogd. Automount seemes to start OK but when I ls automounted directory (node is client, head is server), ls hangs, automount process hangs (kill -9 needed to kill it) and there is no error message anywhere. I have syslogd working on node and automount is full of syslog() calls, but in this case is says nothing, it probably hangs early when receiving automount request from kernel. The same automount setup works when head is client and node is server. (Is there any way to force compiled kernel to give out more debug messages? E.g. write somewhere to /proc?) Running daemons on bproc nodes is tricky and probably worth mini-howto. (Yes, I try to avoid daemons on nodes, but sometimes I really need them.) If you know any related documents, please let me know. Thanks Vaclav ------------------------------------------------------- My setup: RedHat 7.2 with most rpm updates till Nov 2001, Clustermatic (March 2002 version), kernel 2.4.18-lanl.16, automount version 3.1.7 (also tested 4.0.0). ==> /etc/auto.master <== # /etc/auto.master /nfs /etc/auto.nfs rw,intr,rsize=8192,wsize=8192 ==> /etc/auto.nfs <== # /etc/auto.nfs * -fstype=autofs,-Dhost=& file:/etc/auto.sub ==> /etc/auto.sub <== # /etc/auto.sub * ${host}:/& ==> /etc/beowulf/syslog.conf <== # syslog.conf for magi nodes # log everything on screen: *.* /dev/console # log everything to head (magi): *.* @10.0.4.1 ==> /etc/beowulf/exports.node <== # exports for beowulf node (experimentel) /etc 10.0.4.1(ro) ==> /etc/beowulf/nsswitch.conf <== passwd: bproc files hosts: bproc ==> /etc/exports <== # magi exports /bin 10.0.4.0/255.255.255.0(ro) /home noel(rw) 10.0.4.0/255.255.255.0(ro) /lib 10.0.4.0/255.255.255.0(ro) /sbin 10.0.4.0/255.255.255.0(ro) /usr 10.0.4.0/255.255.255.0(ro) /var 10.0.4.0/255.255.255.0(ro) ==> /etc/sysconfig/syslog <== # Options to syslogd # -m 0 disables 'MARK' messages. # -r enables logging from remote machines # -x disables DNS lookups on messages recieved with -r # See syslogd(8) for more details #SYSLOGD_OPTIONS="-m 0" # VH: added -r for log from magi nodes to magi master: SYSLOGD_OPTIONS="-m 0 -r" # Options to klogd # -2 prints all kernel oops messages twice; once for klogd to decode, and == And my main experimental script: == #!/bin/bash # beowulf node startup for magi # New (RH7.2_Clustermatic/magi) version echo This is /nfs/noel/home/hanzl/beowulf/startnode ########### CONFIG AREA ############ #nodenum: N=1 #master IP: MASTER=10.0.4.1 # directories to NFS-replicate from master (read only): NFSDIRS='bin sbin usr' # directories to just create on node: CREATEDIRS='tmp nfs .autofsck var/lib/nfs/sm var/run var/lock/subsys var/nis' # files to copy from master to node with identical pathname: COPYFILES='/etc/auto.master /etc/auto.nfs /etc/auto.sub /etc/services /etc/rpc /etc/protocols /etc/passwd /etc/group' # modules needed on nodes: MODULES='sunrpc lockd nfs nfsd autofs' #################################### echo Master IP: $MASTER, node number: $N bpstat $N echo Rebooting node $N... bpctl -S $N -s reboot while true; do STATUS=`bpstat $N -s` echo -n ' '$STATUS if [[ "$STATUS" == "up" ]]; then break; fi sleep 1 done echo '' echo Node $N is up! #################################### echo "Creating any missing directories on node $N:" for d in $NFSDIRS $CREATEDIRS; do echo -n ' '$d bpsh $N mkdir -p /$d done echo '' echo ... done #################################### echo "Inserting modules on node $N" for m in $MODULES; do echo -n ' '$m modprobe --node $N $m done echo '' echo ... done #################################### echo "Copying config files to node $N" echo "Files with identical pathname:" for f in $COPYFILES; do echo -n ' '$f bpcp $f $N:$f done echo '' echo ... done echo "Files with different pathname:" bpcp /etc/beowulf/syslog.conf $N:/etc/syslog.conf bpcp /etc/beowulf/exports.node $N:/etc/exports bpcp /etc/beowulf/nsswitch.conf $N:/etc/nsswitch.conf echo ... done #################################### #################################### echo "Hello to console"|bpsh $N dd of=/dev/console 2>/dev/null #################################### echo "Starting basic daemons on node $N" bpsh $N touch /var/lib/nfs/rmtab bpsh $N touch /var/lib/nfs/xtab bpsh $N touch /var/lib/nfs/etab echo "portmap..." bpsh $N portmap echo "syslogd..." bpsh $N syslogd echo "klogd..." bpsh $N klogd bpsh $N logger "Starting basic rpc daemons on node $N" echo "rpc.statd..." # statd will run in /var/lib/nfs/statd as rpcuser and needs rw access bpsh $N mkdir -p /var/lib/nfs/statd bpsh $N chown rpcuser /var/lib/nfs/statd bpsh $N chgrp rpcuser /var/lib/nfs/statd bpsh $N rpc.statd bpsh $N touch /var/lock/subsys/nfslock echo "Expected rpcinfo:" nlockmgr portmapper status echo " Actual rpcinfo:" `bpsh $N rpcinfo -p|awk '{print $5}'|sort|uniq` echo "(maybe they did not start yet...)" # 'nlockmgr' is provided by kernel module, not daemon #################################### # (This could have been done even just after portmap, but is probably safer here) echo "Mounting NFS directories, node $N is client, head is server" bpsh $N logger "Mounting NFS directories, node $N is client, head is server" for d in $NFSDIRS; do echo -n ' '$d bpsh 1 mount 10.0.4.1:/$d /$d done echo '' echo ... done #################################### echo "Expected rpcinfo:" nlockmgr portmapper status echo " Actual rpcinfo:" `bpsh $N rpcinfo -p|awk '{print $5}'|sort|uniq` echo "Starting NFS server daemons on node $N" bpsh $N logger "Starting NFS server daemons on node $N" # We should provide: mountd nfs rquotad RPCNFSDCOUNT=8 echo 'exportfs (not daemon)...' bpsh $N exportfs -r ## echo "rpc.rquotad" ## bpsh $N rpc.rquotad ## # hangs (but shows in rpcinfo), avoid it echo "rpc.mountd..." # Special treatment needed: rpc.mountd with socket on stdin goes mad # and therefore cannot be started by bpsh directly. Stdin redirect # on node helps. You can either NFS-mount /usr/sbin or you can use # new bproc, which delivers missing executables when absolute path # is used. bpsh $N sh -c '/usr/sbin/rpc.mountd </dev/null' echo "rpc.nfsd (count=$RPCNFSDCOUNT)..." bpsh $N rpc.nfsd $RPCNFSDCOUNT bpsh $N touch /var/lock/subsys/nfs echo "NFS server on node $N should work now" #################################### echo "Expected rpcinfo:" mountd nfs nlockmgr portmapper status echo " Actual rpcinfo:" `bpsh $N rpcinfo -p|awk '{print $5}'|sort|uniq` echo "Testing automount (head is client, node $N is server):" echo "ls /nfs/n$N/etc:" ls /nfs/n$N/etc #################################### echo "Processes on node $N seen in head PID-space:" ps aux|bpstat -P|grep ^$N'[^0-9]' #################################### exit echo "Starting autofs client on node $N" bpsh $N logger "Starting autofs client on node $N" bpsh $N /usr/sbin/automount /nfs file /etc/auto.nfs rw,intr,rsize=8192,wsize=8192 bpsh $N touch /var/lock/subsys/autofs #bpsh $N sh -c '/usr/sbin/automount /nfs file /etc/auto.nfs rw,intr,rsize=8192,wsize=8192' #bpsh $N sh -c '/root/autofs-4.0.0pre10/daemon/automount /nfs file /etc/auto.nfs rw,intr,rsize=8192,wsize=8192' # bpsh $N ls /nfs/n-1/var wil HANG !!! :-( ==================== END ===================== |