You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(25) |
Nov
|
Dec
(22) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(13) |
Feb
(22) |
Mar
(39) |
Apr
(10) |
May
(26) |
Jun
(23) |
Jul
(38) |
Aug
(20) |
Sep
(27) |
Oct
(76) |
Nov
(32) |
Dec
(11) |
2003 |
Jan
(8) |
Feb
(23) |
Mar
(12) |
Apr
(39) |
May
(1) |
Jun
(48) |
Jul
(35) |
Aug
(15) |
Sep
(60) |
Oct
(27) |
Nov
(9) |
Dec
(32) |
2004 |
Jan
(8) |
Feb
(16) |
Mar
(40) |
Apr
(25) |
May
(12) |
Jun
(33) |
Jul
(49) |
Aug
(39) |
Sep
(26) |
Oct
(47) |
Nov
(26) |
Dec
(36) |
2005 |
Jan
(29) |
Feb
(15) |
Mar
(22) |
Apr
(1) |
May
(8) |
Jun
(32) |
Jul
(11) |
Aug
(17) |
Sep
(9) |
Oct
(7) |
Nov
(15) |
Dec
|
From: Nicholas H. <he...@se...> - 2003-08-03 07:58:15
|
On Sun, 2003-08-03 at 03:13, Nicholas Henke wrote: > > And the code where it bomb looks like: > assert(pthread_getattr_func != NULL, "floating stack pthread > must have pthread_getattr_np"); > pthread_attr_t attr; > pthread_t tid = pthread_self(); > int rslt = > ((pthread_getattr_func_type)pthread_getattr_func)(tid, &attr); > if (rslt!=0) { > /* what else can we do?? */ > fatal("Can not locate current stack region!"); > } Just a followup, it looks like rslt == 3 here, which if I am reading things correctly corresponds to the errno : #define.ESRCH.. 3. /* No such process */ I am inlining some of the code for pthread_getattr_np (glibc-linuxthread/attr.c), there is a few lines that access the pid of the thread for some scheduling stuff. Maybe the pid is not getting masked correctly or something along those lines? int pthread_getattr_np (pthread_t thread, pthread_attr_t *attr) { pthread_handle handle = thread_handle (thread); pthread_descr descr; if (handle == NULL) return ENOENT; descr = handle->h_descr; attr->__detachstate = (descr->p_detached . . . ? PTHREAD_CREATE_DETACHED . . . : PTHREAD_CREATE_JOINABLE); attr->__schedpolicy = __sched_getscheduler (descr->p_pid); if (attr->__schedpolicy == -1) return errno; if (__sched_getparam (descr->p_pid, . . . (struct sched_param *) &attr->__schedparam) != 0) return errno; The rest of the code does not return anything besides 0 or 1, so I am guessing it is the above actions that barfs it. Much confusion ;) Nic -- Nicholas Henke Penguin Herder & Linux Cluster System Programmer Liniac Project - Univ. of Pennsylvania |
From: Nicholas H. <he...@se...> - 2003-08-03 07:13:34
|
Ok -- Let's see if this shed's some light. After managing to get sun's java compiled on Linux, I can now get some decent core dumps & stack traces from the lil bugger. ( BTW ld -z defs really blows ...) It appears that it is some pthread interaction, as the thread cannot locate the current stack region. If any more information is needed, I have core files, and can reproduce this at will. Here is the java error: # HotSpot Virtual Machine Error, Internal Error # Please report this error at # http://java.sun.com/cgi-bin/bugreport.cgi # # Java VM: Java HotSpot(TM) Client VM (1.4.1-internal-henken_03_aug_2003_01_28-debug mixed mode) # # Fatal: Can not locate current stack region! # # Error ID: /scratch/user/henken/hotspot/src/os/linux/vm/os_linux.cpp, 727 # # Problematic Thread: [error occured during error reporting] prio=1079672305 tid=0x0x80988a8 nid=0x4a6c runnable # Current thread is 0x402 Dumping core ... Ok, and the mighty gdb shows us: #3 0x40509bdf in os::abort (dump_core=1) at /scratch/user/henken/hotspot/src/os/linux/vm/os_linux.cpp:1018 #4 0x403f4ad6 in report_error (is_vm_internal_error=1, file_name=0x407b3940 "/scratch/user/henken/hotspot/src/os/linux/vm/os_linux.cpp", line_no=727, title=0x406f6f4d "Internal Error", format=0x406f6f43 "Fatal: %s") at /scratch/user/henken/hotspot/src/share/vm/utilities/debug.cpp:343 #5 0x403f432b in report_fatal ( file_name=0x407b3940 "/scratch/user/henken/hotspot/src/os/linux/vm/os_linux.cpp", line_no=727, format=0x407b3900 "Can not locate current stack region!") at /scratch/user/henken/hotspot/src/share/vm/utilities/debug.cpp:139 #6 0x405092f0 in current_stack_region (bottom=0x4abe0a40, size=0x4abe0a3c) at /scratch/user/henken/hotspot/src/os/linux/vm/os_linux.cpp:727 #7 0x4050966a in os::current_stack_base () at /scratch/user/henken/hotspot/src/os/linux/vm/os_linux.cpp:871 #8 0x40560a67 in Thread::record_stack_base_and_size (this=0x80988a8) at /scratch/user/henken/hotspot/src/share/vm/runtime/thread.cpp:89 #9 0x4059c8a7 in VMThread::run (this=0x80988a8) at /scratch/user/henken/hotspot/src/share/vm/runtime/vmThread.cpp:165 #10 0x40508b46 in _start (thread=0x80988a8) at /scratch/user/henken/hotspot/src/os/linux/vm/os_linux.cpp:406 #11 0x40021c6f in pthread_start_thread (arg=0x4abe0be0) at manager.c:284 And the code where it bomb looks like: assert(pthread_getattr_func != NULL, "floating stack pthread must have pthread_getattr_np"); pthread_attr_t attr; pthread_t tid = pthread_self(); int rslt = ((pthread_getattr_func_type)pthread_getattr_func)(tid, &attr); if (rslt!=0) { /* what else can we do?? */ fatal("Can not locate current stack region!"); } I have NO idea what any of this means, but I would love to do some more debugging on it. -- Nicholas Henke Penguin Herder & Linux Cluster System Programmer Liniac Project - Univ. of Pennsylvania |
From: <er...@he...> - 2003-08-01 15:31:07
|
On Fri, Aug 01, 2003 at 01:14:58AM +0200, J.A. Magallon wrote: > > On 07.31, er...@he... wrote: > > On Wed, Jul 30, 2003 at 10:55:12PM -0400, Nicholas Henke wrote: > > > On Tue, 2003-07-01 at 11:05, er...@he... wrote: > > > > > > > > I believe clone works. Most of the interesting stuff with clone is > > > > local to the node and BProc doesn't get involved at all. So, in > > > > theory, it should be possible to make Java work. > > > > > Sorry for the late answer... > > I am running pthread programs on nodes with bpsh. It works fine with > standard kernels and bproc (no so fine with kernels that have O1 sched, > but that's another story...) > They are not very pthread intensive, just spawn at the beginning, do > heay render work and a bit of syncronisation and per-thread variables. > > If you point to problems in your kernel, you can try with this: > > http://giga.cps.unizar.es/~magallon/linux/kernel/2.4.22-pre9-jam1m.tar.gz > > (patches on top of 2.4.22-pre9) > > Some ideas. AFAIK, java threads are built upon POSIX pthreads, not clone > directly. PThreads themselves are built upon clone() syscall. > And here comes the funny part. Attached is a simple example of spawning > a program over the cluster nodes with bproc, and on each node spawn tasks > with glibc clone() call. If you build and link with -lbproc, it works. > If you link with -lbproc -lpthreads (even if you don't use any pthread > call), it does not work. So the conclussion is that libpthread is > overriding the clone() glibc syscal, and breaks it. Yeah, pthreads gets its dirty little fingers in a lot of places. I believe it mostly just tries to clean up with its wrappers around fork, clone, etc. Another unfortunate side effect of the pthreads implementation is that it does not recognize bproc_rfork, etc. as fork-like calls so the child process is sometimes hosed if those are called from a pthreadeded program. The whole mess is severely flawed, IMO. - Erik P.S. I'm talking about Linuxthreads here, not NPTL. I haven't looked at NPTL yet but it should be much less screwy. |
From: J.A. M. <jam...@ab...> - 2003-07-31 23:15:10
|
On 07.31, er...@he... wrote: > On Wed, Jul 30, 2003 at 10:55:12PM -0400, Nicholas Henke wrote: > > On Tue, 2003-07-01 at 11:05, er...@he... wrote: > > > > > > I believe clone works. Most of the interesting stuff with clone is > > > local to the node and BProc doesn't get involved at all. So, in > > > theory, it should be possible to make Java work. > > Sorry for the late answer... I am running pthread programs on nodes with bpsh. It works fine with standard kernels and bproc (no so fine with kernels that have O1 sched, but that's another story...) They are not very pthread intensive, just spawn at the beginning, do heay render work and a bit of syncronisation and per-thread variables. If you point to problems in your kernel, you can try with this: http://giga.cps.unizar.es/~magallon/linux/kernel/2.4.22-pre9-jam1m.tar.gz (patches on top of 2.4.22-pre9) Some ideas. AFAIK, java threads are built upon POSIX pthreads, not clone directly. PThreads themselves are built upon clone() syscall. And here comes the funny part. Attached is a simple example of spawning a program over the cluster nodes with bproc, and on each node spawn tasks with glibc clone() call. If you build and link with -lbproc, it works. If you link with -lbproc -lpthreads (even if you don't use any pthread call), it does not work. So the conclussion is that libpthread is overriding the clone() glibc syscal, and breaks it. Attached are bcl.c and Makefile, see the LIBS variable in Makefile. -- J.A. Magallon <jam...@ab...> \ Software is like sex: werewolf.able.es \ It's better when it's free Mandrake Linux release 9.2 (Cooker) for i586 Linux 2.4.22-pre9-jam1m (gcc 3.3.1 (Mandrake Linux 9.2 3.3.1-0.7mdk)) |
From: Nicholas H. <he...@se...> - 2003-07-31 16:55:27
|
On Thu, 2003-07-31 at 12:33, er...@he... wrote: > bpsh X strace blah > > That way bpsh doesn't get caught in the strace. > > Oh, an SYS_223 is BProc. I wouldn't bother trying to figure out what > those calls are since it's bpsh doing them and not Java. It's going > to be some node info stuff and then a call to bproc_vexecmove_io. *sigh* that is what I get for trying to debug something while dead tired. I will forward the correct strace soon. Nic -- Nicholas Henke Penguin Herder & Linux Cluster System Programmer Liniac Project - Univ. of Pennsylvania |
From: <er...@he...> - 2003-07-31 16:50:11
|
On Wed, Jul 30, 2003 at 10:55:12PM -0400, Nicholas Henke wrote: > On Tue, 2003-07-01 at 11:05, er...@he... wrote: > > > > I believe clone works. Most of the interesting stuff with clone is > > local to the node and BProc doesn't get involved at all. So, in > > theory, it should be possible to make Java work. > > Ok -- finally getting around to tracking this down again. > > > In terms of what needs to be done, that depends entirely on what > > you're trying to run. I've done some simple pthreads things on nodes > > w/o problems. The first place to look is probably strace output of a > > program that fails. Then try and figure out how what the app is > > seeing differs from what it's expecting. > > Ok -- attached is the end of the strace from 'strace -f bpsh 7 > /usr/java/j2sdk1.4.1_01/bin/java -version'. There are some calls to > SYS_223, which I am guessing is the bproc syscall. I am going to attempt > to decipher those calls a bit, but I would appreciate any help you may > have on this one. I think you're going to have some difficulty getting good/clean straces if you're doing strace -f with bpsh. The -f flag doesn't catch the children of bpsh since strace doesn't recognize bproc_vexecmove_io as a variant of fork() (which it is). The "-f" flag is racy in general. The child process will likely run a bit before strace gets a chance to attach to it. This is presuming that strace isn't quite clever enough to force CLONE_PTRACE yet. I'm not sure if it is. It might be. strace might be problematic for Java in general since it might call clone a lot on its own. Anyway, the point of this rambling story is that when I need to strace something on a slave node, I usually end up copying the binary out to the slave node and doing something like: bpsh X strace blah That way bpsh doesn't get caught in the strace. Oh, an SYS_223 is BProc. I wouldn't bother trying to figure out what those calls are since it's bpsh doing them and not Java. It's going to be some node info stuff and then a call to bproc_vexecmove_io. - Erik |
From: Nicholas H. <he...@se...> - 2003-07-31 02:55:30
|
On Tue, 2003-07-01 at 11:05, er...@he... wrote: > > I believe clone works. Most of the interesting stuff with clone is > local to the node and BProc doesn't get involved at all. So, in > theory, it should be possible to make Java work. Ok -- finally getting around to tracking this down again. > In terms of what needs to be done, that depends entirely on what > you're trying to run. I've done some simple pthreads things on nodes > w/o problems. The first place to look is probably strace output of a > program that fails. Then try and figure out how what the app is > seeing differs from what it's expecting. Ok -- attached is the end of the strace from 'strace -f bpsh 7 /usr/java/j2sdk1.4.1_01/bin/java -version'. There are some calls to SYS_223, which I am guessing is the bproc syscall. I am going to attempt to decipher those calls a bit, but I would appreciate any help you may have on this one. Thanks! Nic -- Nicholas Henke Penguin Herder & Linux Cluster System Programmer Liniac Project - Univ. of Pennsylvania |
From: Nicholas H. <he...@se...> - 2003-07-31 02:42:10
|
Ok -- I swear this the last 'whiny' email from me for a while. I have noticed that the /proc/self/fd/3 hack for getting shell scripts to run on the nodes tends to bite things like java and matlab that use $0 to find the path to the installation. Would it be possible to set argv[0] to the original path, in hopes that the paths would be found ? I am just curious as to the things that would prevent this -- I can cook up a patch for this if it should work. Thanks! Nic -- Nicholas Henke Penguin Herder & Linux Cluster System Programmer Liniac Project - Univ. of Pennsylvania |
From: Nicholas H. <he...@se...> - 2003-07-31 02:35:47
|
On Tue, 2003-07-29 at 10:56, er...@he... wrote: > On Tue, Jul 29, 2003 at 09:50:53AM -0400, Nicholas Henke wrote: > > Hey Erik~ > > How are things on your end? Pretty well here. > > I am attempting to use the proc_pid_map to see all of the processes for > > a user on a remote node, so that I can use bpsh to kill them. Some of > > these processes are running via ssh, so of course they are not in the > > bproc pid space. Can you tell me where I am going wrong ? > > I'm a little fuzzy on exactly what you're trying to do here. If you > want to kill a process in the slave's process space from a process > (kill) in the master's process space, that won't work. You can't send > signals across process spaces. Even with process ID mapping turned > off in /proc (proc_pid_map), the mapping still happens for system > calls (fork, wait, kill). *sigh* I really need to smack my crack dealer around -- it seems I am not getting the good stuff anymore. Thanks for the explanation ;) > The reason I have the option to turn off pid mapping is that it allows > you to see what's going on on the node even if you can't directly fix > it from within BProc. I've had situations where some bit of the node > boot-up stuff was spinning eating up 20% cpu for no apparent reason > and the only way to see that was turning off pid mapping. > Ok -- much more sense. > btw, in case it wasn't clear: > > proc_pid_map == 2: Do mapping for all users > proc_pid_map == 1: Do mapping for users but not for root. > proc_pid_map == 0: Do mapping for nobody. Nic -- Nicholas Henke Penguin Herder & Linux Cluster System Programmer Liniac Project - Univ. of Pennsylvania |
From: <er...@he...> - 2003-07-29 15:42:52
|
On Tue, Jul 29, 2003 at 09:50:53AM -0400, Nicholas Henke wrote: > Hey Erik~ > How are things on your end? Pretty well here. > I am attempting to use the proc_pid_map to see all of the processes for > a user on a remote node, so that I can use bpsh to kill them. Some of > these processes are running via ssh, so of course they are not in the > bproc pid space. Can you tell me where I am going wrong ? I'm a little fuzzy on exactly what you're trying to do here. If you want to kill a process in the slave's process space from a process (kill) in the master's process space, that won't work. You can't send signals across process spaces. Even with process ID mapping turned off in /proc (proc_pid_map), the mapping still happens for system calls (fork, wait, kill). The reason I have the option to turn off pid mapping is that it allows you to see what's going on on the node even if you can't directly fix it from within BProc. I've had situations where some bit of the node boot-up stuff was spinning eating up 20% cpu for no apparent reason and the only way to see that was turning off pid mapping. btw, in case it wasn't clear: proc_pid_map == 2: Do mapping for all users proc_pid_map == 1: Do mapping for users but not for root. proc_pid_map == 0: Do mapping for nobody. - Erik |
From: Nicholas H. <he...@se...> - 2003-07-29 13:51:09
|
Hey Erik~ How are things on your end? Pretty well here. I am attempting to use the proc_pid_map to see all of the processes for a user on a remote node, so that I can use bpsh to kill them. Some of these processes are running via ssh, so of course they are not in the bproc pid space. Can you tell me where I am going wrong ? bpsh 0 -O /proc/sys/bproc/proc_pid_map echo 1 [root@struggles root]# bpsh 0 ps -ef UID PID PPID C STIME TTY TIME CMD root 1 0 0 Jul15 ? 00:00:14 init root 2 1 0 Jul15 ? 00:00:00 [keventd] root 3 0 0 Jul15 ? 00:00:00 [ksoftirqd_CPU0] root 4 0 0 Jul15 ? 00:00:00 [ksoftirqd_CPU1] root 5 0 0 Jul15 ? 00:00:00 [kswapd] root 6 0 0 Jul15 ? 00:00:00 [bdflush] root 7 0 0 Jul15 ? 00:00:01 [kupdated] root 8 1 0 Jul15 ? 00:00:00 [mdrecoveryd] root 14 1 0 Jul15 ? 00:00:00 [scsi_eh_0] root 15 1 0 Jul15 ? 00:00:00 [scsi_eh_1] root 18 1 0 Jul15 ? 00:00:00 [kjournald] root 337 1 0 Jul15 ? 00:00:00 /sbin/dhcpcd -n eth0 root 416 1 0 Jul15 ? 00:00:01 syslogd -m 0 root 421 1 0 Jul15 ? 00:00:00 klogd -2 rpc 431 1 0 Jul15 ? 00:00:00 portmap rpcuser 452 1 0 Jul15 ? 00:00:00 rpc.statd root 496 1 0 Jul15 ? 00:00:00 [rpciod] root 497 1 0 Jul15 ? 00:00:00 [lockd] root 525 1 0 Jul15 ? 00:00:00 /usr/sbin/sshd root 551 1 0 Jul15 ? 00:00:37 sendmail: accepting connections root 577 1 0 Jul15 ? 00:00:00 gpm -t ps/2 -m /dev/mouse root 587 1 0 Jul15 ? 00:00:01 crond daemon 607 1 0 Jul15 ? 00:00:00 /usr/sbin/atd root 614 1 0 Jul15 tty1 00:00:00 /sbin/mingetty tty1 root 615 1 0 Jul15 tty2 00:00:00 /sbin/mingetty tty2 root 616 1 0 Jul15 tty3 00:00:00 /sbin/mingetty tty3 root 617 1 0 Jul15 tty4 00:00:00 /sbin/mingetty tty4 root 618 1 0 Jul15 tty5 00:00:00 /sbin/mingetty tty5 root 619 1 0 Jul15 tty6 00:00:00 /sbin/mingetty tty6 root 5561 1 0 Jul21 ? 00:00:05 /usr/sbin/bpslave -r strugglesi root 5562 5561 0 Jul21 ? 00:00:00 /usr/sbin/bpslave -r strugglesi root 5570 5561 0 Jul21 ? 00:00:06 mond henken 8759 5561 0 10:23 ? 00:00:00 sleep 3600 root 8766 525 0 09:44 ? 00:00:00 /usr/sbin/sshd root 8767 8766 0 09:44 pts/0 00:00:00 -bash root 8819 5561 0 10:24 ? 00:00:00 ps -ef [root@struggles root]# bpsh 0 kill -9 8759 kill 8759: No such process Thanks! Nic -- Nicholas Henke Penguin Herder & Linux Cluster System Programmer Liniac Project - Univ. of Pennsylvania |
From: Brian B. <brb...@la...> - 2003-07-28 01:53:48
|
On Thursday, July 24, 2003, at 03:46 PM, Gregory Shakhnarovich wrote: > 2) When I try to run LAM/mpirun, it only manages to run executables > which > are available on the nodes. So, since in our configuration my home > directory is not mounted on the nodes (they are of course mounted on > the > head node). , I have to move the executable somewhere to make it > accessible to mpirun. <snip> > Is there a solution to this, without mounting the home directories on > the > nodes? I can't add much useful about your Java problem, but I can comment on your problems with LAM :). LAM/MPI does automatically grab binaries from the head node to the client nodes. This is one of the current restrictions in the LAM port to the BProc environment, and will probably be a restriction for at least the life span of the 7.0 release series. It is possible to have LAM push the binaries out to the LAM session directory and have them execute there (note that this is very different than the BProc binary push). This can be accomplished using the -s option to mpirun: mpirun -s n0 <app> This functionality uses UDP point-to-point communication and does not scale particularly well. However, it will work and may be the only viable option you have to mounting a common filesystem on your cluster. Last I looked at the BProc interface, it was going to be difficult to implement the automatic migration of user binaries out to the compute nodes. For various reasons, the processes started on the nodes must be started by the LAM daemon on that node. Unless the situation has improved, this means forking the LAM daemon, migrating to the head node, execing, and migrating back to wherever I cam from. This was a bit too much to chew off for the first stable release of LAM's BProc support :). It is possible that there is a better way to do this - I'd be happy to hear any advice from the BProc experts out there.... Hope this helps, Brian -- Brian Barrett LAM/MPI developer and all around nice guy Have a LAM/MPI day: http://www.lam-mpi.org/ |
From: Nicholas H. <he...@se...> - 2003-07-26 13:04:38
|
On Thu, 2003-07-24 at 18:46, Gregory Shakhnarovich wrote: > Hi, > > I am hoping someone has faced similar problems and know a solution. A bit > of info: we run a 2.4.20-bproc Debian kernel, with Clubmask, on a 32 node > cluster. > > 1) Despite the fact that /etc/beowulf/config says, among others, > > libraries /usr/lib/j2se/1.4/jre/lib/i386/libja* > > When I try to run Java, I get: > > [378] bourbaki:/root>bpsh 0 java --version > Error: could not find libjava.so > Error: could not find Java 2 Runtime Environment. > > We have fixed similar problems with other libraris by adding them to > 'config' and rebooting. But not this one. Anything special about it? The java 'binary' is actually a shell script that uses $0 to find the path to the java installation. Since bproc does a kinda funky shell hack, this name does not get preserved, in fact it gets replaced with '/proc/self/fd/3', and ofcourse there are no java libraries in that directory tree. I am attaching two file, a shell script and a C program. The shell program you can use like: bpsh 0 cmjava.sh /path/to/java arg1 arg2 ... The C program gets compiled: gcc -o cmjava javawrapper.c -lbproc and run: cmjava 0 /path/to/java arg1 arg2... Let me know if either of these work for you -- and if not, a ton of debugging information. If these end up working for your java needs, we can include them in Clubmask. Nic -- Nicholas Henke Penguin Herder & Linux Cluster System Programmer Liniac Project - Univ. of Pennsylvania |
From: Michael M. <mm...@as...> - 2003-07-24 23:22:31
|
Gregory Shakhnarovich wrote: > > > >2) When I try to run LAM/mpirun, it only manages to run executables which >are available on the nodes. So, since in our configuration my home >directory is not mounted on the nodes (they are of course mounted on the >head node). , I have to move the executable somewhere to make it >accessible to mpirun. > >I am experimenting with Nic's cpi.sh, from Clubmask User Guide; I think I >tried all combination of working directories and paths in mpirun, but >unless the program is visible to the nodes, I get: > >mpirun: cannot start ./cpi on n1: No such file or directory > >I do not have the same problem with a non-MPI executables, which can be >happily started through bpsh from the head node, even though the nodes >don't see them. > >Is there a solution to this, without mounting the home directories on the >nodes? > > Greg, You can instruct mpirun to copy the binary to the compute nodes. This is what I use in cpi.sh: mpirun c1-$NUMNODES -s h ./cpi Mike |
From: Gregory S. <gr...@ai...> - 2003-07-24 22:46:17
|
Hi, I am hoping someone has faced similar problems and know a solution. A bit of info: we run a 2.4.20-bproc Debian kernel, with Clubmask, on a 32 node cluster. 1) Despite the fact that /etc/beowulf/config says, among others, libraries /usr/lib/j2se/1.4/jre/lib/i386/libja* When I try to run Java, I get: [378] bourbaki:/root>bpsh 0 java --version Error: could not find libjava.so Error: could not find Java 2 Runtime Environment. We have fixed similar problems with other libraris by adding them to 'config' and rebooting. But not this one. Anything special about it? 2) When I try to run LAM/mpirun, it only manages to run executables which are available on the nodes. So, since in our configuration my home directory is not mounted on the nodes (they are of course mounted on the head node). , I have to move the executable somewhere to make it accessible to mpirun. I am experimenting with Nic's cpi.sh, from Clubmask User Guide; I think I tried all combination of working directories and paths in mpirun, but unless the program is visible to the nodes, I get: mpirun: cannot start ./cpi on n1: No such file or directory I do not have the same problem with a non-MPI executables, which can be happily started through bpsh from the head node, even though the nodes don't see them. Is there a solution to this, without mounting the home directories on the nodes? Thanks, -- Greg Shakhnarovich AI Lab, MIT NE43-V611 Cambridge, MA 02139 tel (617) 253-8170 fax (617) 258-6287 |
From: Pirabhu R. <pir...@mp...> - 2003-07-17 19:26:11
|
I have a question regarding bproc_move( ). I tried using bproc_move( ) in one of my complex programs and the program dies everytime at the call to the function bproc_move. The function does not return from the call neither are any error messages printed. The same bproc_move( ) runs fine when used in one of my simple test programs. Could this be because of some library dependencies in my earlier program? Could someone list the possible cases when bproc_move can fail? Any help is appreciated. Thanks, Pirabhu |
From: Michael M. <mm...@as...> - 2003-07-17 15:49:17
|
Sundaram A wrote: > Dear All, > > I am very new to beowulf clustering technologies, i am having problem > while compiling the sources downloaded from > http://bproc.sourceforge.net . while try to compile it with my kernel > it give error 2. if anyone have came accross the dificulties please > help me. > > *Thanks & Best Regards* > *Sundar* > I would need to know which package you are compiling (bproc, kernel, beoboot, etc...) as well as the complete error output. Mike |
From: Sundaram A <su...@nu...> - 2003-07-17 01:04:09
|
Dear All, =20 I am very new to beowulf clustering technologies, i am having problem while compiling the sources downloaded from http://bproc.sourceforge.net . while try to compile it with my kernel it give error 2. if anyone have came accross the dificulties please help me. Thanks & Best Regards=20 Sundar=20 _____ =20 "This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately; you should not copy or use it for any purpose, nor disclose its contents to any other person. Thank you."=20 =20 |
From: <er...@he...> - 2003-07-16 19:30:43
|
On Wed, Jul 16, 2003 at 11:37:26AM -0400, gor...@ph... wrote: > Hi Erik, > > Hows everything? I hope they're keeping you busy at Los Alamos. Heh, I wouldn't worry about keeping me busy. Too busy really. > I wrote > to ask if you had an ETA for the 2.4.21 patch? I have to add support for > bproc on some cluster hardware which uses Promise IDE controllers, and I > have do decide whether to try to pull the needed IDE patches out of > RedHat's 2.4.20 kernels,, and stick them into the stock 2.4.20 kernel, or > whether to wait for the 2.4.21 kernel, which already has all the IDE stuff > I need. Thanks, in advance, for the advice. I have a patch attached. I think it should be fine. I haven't released yet since I've been away, caught up working on other stuff and hunting hunting a remaining bug. This should be as solid on w/ BProc 3.2.5 as anything else though. Apply the kernel patch as usual and make this one little change in bproc/kernel/interface.c: Change: inode = get_empty_inode(); to: inode = new_inode(bprocfs_mnt->mnt_sb); - Erik |
From: Michael M. <mm...@as...> - 2003-07-15 16:16:43
|
The gethostbyaddr function implemented by beonss returns a NULL pointer in the h_aliases member of the hostent structure. I believe it should be returning a pointer to a NULL terminated array of alias strings. If there are no aliases, then the array only contains the NULL. Mike |
From: Nicholas H. <he...@se...> - 2003-07-10 19:26:53
|
On Wed, 2003-07-09 at 13:52, er...@he... wrote: > > Well, that's encouraging at least. There's definitely some bogosity > fixed by that patch but I guess there's more. > > Are you still getting processes which are unkillable w/ signal 9? The > processes that you had to go and kill on the node, they were gone from > the master's process tree, right? Nope -- the get reparented to init, but it's calling process is still waiting for it, and I can kill -9 it from the head node, after which its calling process ( the 'sh' ) exits. > > Here's another thing to try as a diagnostic. Comment out this line in > daemons/master.c > > do_parent_exit(req); Tried that -- It would appear that 6 of the 7 nodes hung almost at the same time -- I could not tell if it was _exactly_ the same time. Also -- just as an observation, all of the pids were in the range of 400 to 700 -- could this be a pid wrap around problem ? Also -- I have noticed that after doing a kill -9 on the processes, sometimes the bpslave process will get nuked as well, leaving just one bpslave process on the node. Restarting bpslave allows the node to come backup, and bpsh to that node then works fine. Nic -- Nicholas Henke Penguin Herder & Linux Cluster System Programmer Liniac Project - Univ. of Pennsylvania |
From: Michael M. <mm...@as...> - 2003-07-10 18:52:13
|
Hi, Seems like the gethostbyname provided by the version of beonss packaged with Clustermatic 3 does not work. For example, this python code: import sys import bproc, socket nodes = sys.argv[1:] for node in nodes: print node print socket.gethostbyname(node) print bproc.nodenumber(socket.gethostbyname(node)) Fails like this: [root@asl156 cluster]# python test.py n0 n0 Traceback (most recent call last): File "test.py", line 6, in ? print socket.gethostbyname(node) socket.gaierror: (-2, 'Name or service not known') I upgraded to beonss to 1.0.20 and that fixed my problem. Mike |
From: Nicholas H. <he...@se...> - 2003-07-09 18:21:06
|
On Wed, 2003-07-09 at 13:52, er...@he... wrote: > > Well, that's encouraging at least. There's definitely some bogosity > fixed by that patch but I guess there's more. Yeah -- after a while, all of them are hung again. > > Are you still getting processes which are unkillable w/ signal 9? The > processes that you had to go and kill on the node, they were gone from > the master's process tree, right? Nope -- see attached. > > Here's another thing to try as a diagnostic. Comment out this line in > daemons/master.c > > do_parent_exit(req); > > I think there might be something sketchy going on with that code > although I'm don't know exactly what. Removing the "parent exit" > stuff will have some implications for correctness - getppid() might > return the wrong answer if your parent was on another node and it has > exited. It *shouldn't* have any implications beyond that, however. > > I need to try to replicate this problem again. I will see what this does.... Nic -- Nicholas Henke Penguin Herder & Linux Cluster System Programmer Liniac Project - Univ. of Pennsylvania |
From: <er...@he...> - 2003-07-09 18:03:44
|
On Wed, Jul 09, 2003 at 01:41:11PM -0400, Nicholas Henke wrote: > On Mon, 2003-07-07 at 15:20, er...@he... wrote: > > I have a hunch about what might be going on here. There's some > > potential for badness in exit_notify with BProc. kill_pg and > > is_orphaned_pgrp might end up setting the process state back to > > RUNNING instead of ZOMBIE. Then they could get hung up because the > > ghost is gone because it's already exited. > > > > I've attached a revised patch which I think should fix that. Can you > > try it an see if it helps? > > It seems to have helped, but not solved the problem. It seems like more > of the processes are running, and not getting hung, but there were a few > that did hang. I was able to do a 'top->bottom' kill -9 with a 'sleep 1' > between, and in one case it worked, but in another, I had to go to the > node again and kill -9 the process there. Well, that's encouraging at least. There's definitely some bogosity fixed by that patch but I guess there's more. Are you still getting processes which are unkillable w/ signal 9? The processes that you had to go and kill on the node, they were gone from the master's process tree, right? Here's another thing to try as a diagnostic. Comment out this line in daemons/master.c do_parent_exit(req); I think there might be something sketchy going on with that code although I'm don't know exactly what. Removing the "parent exit" stuff will have some implications for correctness - getppid() might return the wrong answer if your parent was on another node and it has exited. It *shouldn't* have any implications beyond that, however. I need to try to replicate this problem again. - Erik |
From: Nicholas H. <he...@se...> - 2003-07-09 17:41:30
|
On Mon, 2003-07-07 at 15:20, er...@he... wrote: > I have a hunch about what might be going on here. There's some > potential for badness in exit_notify with BProc. kill_pg and > is_orphaned_pgrp might end up setting the process state back to > RUNNING instead of ZOMBIE. Then they could get hung up because the > ghost is gone because it's already exited. > > I've attached a revised patch which I think should fix that. Can you > try it an see if it helps? It seems to have helped, but not solved the problem. It seems like more of the processes are running, and not getting hung, but there were a few that did hang. I was able to do a 'top->bottom' kill -9 with a 'sleep 1' between, and in one case it worked, but in another, I had to go to the node again and kill -9 the process there. Nic -- Nicholas Henke Penguin Herder & Linux Cluster System Programmer Liniac Project - Univ. of Pennsylvania |