You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(25) |
Nov
|
Dec
(22) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(13) |
Feb
(22) |
Mar
(39) |
Apr
(10) |
May
(26) |
Jun
(23) |
Jul
(38) |
Aug
(20) |
Sep
(27) |
Oct
(76) |
Nov
(32) |
Dec
(11) |
2003 |
Jan
(8) |
Feb
(23) |
Mar
(12) |
Apr
(39) |
May
(1) |
Jun
(48) |
Jul
(35) |
Aug
(15) |
Sep
(60) |
Oct
(27) |
Nov
(9) |
Dec
(32) |
2004 |
Jan
(8) |
Feb
(16) |
Mar
(40) |
Apr
(25) |
May
(12) |
Jun
(33) |
Jul
(49) |
Aug
(39) |
Sep
(26) |
Oct
(47) |
Nov
(26) |
Dec
(36) |
2005 |
Jan
(29) |
Feb
(15) |
Mar
(22) |
Apr
(1) |
May
(8) |
Jun
(32) |
Jul
(11) |
Aug
(17) |
Sep
(9) |
Oct
(7) |
Nov
(15) |
Dec
|
From: Aaron M. <aa...@cs...> - 2002-06-25 05:40:25
|
I'm in the process of configuring bproc on 5 Athlon 650's. Currently I have one machine working properly(it can run master and a slave and move processes) and one almost working. The problem is that when starting the bpslave program, immediatly after the bpmaster(running as -dv) shows a connection, the slave crashes. The only difference between the systems is currently a 2 sub-version difference in glibc(not currently running the patched version on either) do I need to upgrade(required mosh of RH updated) glibc or is there something else? aaron -- _______________________________________________________ Aaron Macks(aa...@cs...) My sheep has seven gall bladders, that makes me the King of the Universe! |
From: Larry B. <ba...@us...> - 2002-06-24 18:36:46
|
The web manual for BProc (http://bproc.sourceforge.net/x86.html) talks = about a modified dynamic linker that fetches libraries not found = locally. I assume this does not help find missing libraries during = compilation, only missing libraries at run-time. I modified Intel's = make_exe makefile to compile on the master and run the BLAS tests on = slave 0, which is how I plan to use the cluster anyway. This worked. FYI: execution times from my (uniprocessor) Fortran finite-difference = code: Compaq Alpha DS20E 667 MHz Tru64 Unix 5.1, Compaq Fortran "f90 -fast": 4.93 seconds. Intel P4 1.8A w/ \/IA chipset and DDR2100 WNT4/WS, Microsoft Fortran PowerStation "/Zi /G5": 5.14 seconds. Red Hat Linux, g77 "-O3 -funroll-loops": 5.34 seconds. Red Hat Linux, Intel Fortran (default options): 4.45 seconds. Red Hat Linux, Intel Fortran "-O2 -ip -pad -tpp7 -xW": 4.14 = seconds. Red Hat Linux, Intel Fortran "-O3 -ip -pad -tpp7 -xW": 4.15 = seconds. Apple G4 867 MHz OS 9, Absoft Fortran "-O": 13.89 seconds. We have yet to test and code that uses the Intel BLAS library routines. Larry Baker US Geological Survey |
From: J.A. M. <jam...@ab...> - 2002-06-22 13:09:51
|
On 2002.06.22 Erik Arjan Hendriks wrote: >On Thu, Jun 20, 2002 at 12:29:20AM +0200, J.A. Magallon wrote: >> Hi, bproc users and developers. >> >> I think I have found a weird bug...if it is so, and I am not misunderstanding >> something. > >This is weird. Can you send me the snippet of test code so that I >reproduce it? It seems like the sort of thing that should be easy to >hunt down if I can reproduce it. > Program and Makefile are attached. Defining DO_RFORK uses rfork(), and without it does a fork()-move(). There is a big sleep in pslave() to have time to see process structure with pstree -p. Thanks, I hope you can get what is wrong (or confirm I'm on a _big_ mistake). -- J.A. Magallon \ Software is like sex: It's better when it's free mailto:jam...@ab... \ -- Linus Torvalds, FSF T-shirt Linux werewolf 2.4.19-pre10-jam3, Mandrake Linux 8.3 (Cooker) for i586 gcc (GCC) 3.1.1 (Mandrake Linux 8.3 3.1.1-0.5mdk) |
From: Erik A. H. <er...@he...> - 2002-06-22 01:44:21
|
On Thu, Jun 20, 2002 at 12:29:20AM +0200, J.A. Magallon wrote: > Hi, bproc users and developers. > > I think I have found a weird bug...if it is so, and I am not misunderstanding > something. This is weird. Can you send me the snippet of test code so that I reproduce it? It seems like the sort of thing that should be easy to hunt down if I can reproduce it. - Erik P.S. Sorry for the slow response, I've been on travel for the last week :) |
From: Larry B. <ba...@us...> - 2002-06-22 01:06:06
|
I am a newcomer to Linux and Linux clusters, and I know just enough = system admin commands to get a workstation up and running on our = network. I'm more interested in computing platforms to run our Fortran = and C programs faster, which is how I found Clustermatic. I installed Clustermatic to experiment with a small Beowulf cluster. My = master node is an old 233 MHz PII with 64MB. I installed a 100GB disk = and partitioned it with 4GB for / (includes /boot, /usr, and /opt), 1GB = for /var, 2GB for swap, and the rest (~90 GB) for /home. I have 2 slave = nodes: each is a 1.8 GHz P4-A with 1GB DDR. The slaves boot from floppy = and have no local hard disk (i.e., no swap); all disk access is to NFS = exports from the master (/bin->/bin(ro), /home->/home(rw), = /opt->/opt(ro), /sbin->/sbin(ro), /usr->/usr(ro), and = /var->/var/node.#(rw) in /etc/beowulf/fstab, and I rmdir then soft-link = /tmp->/var/tmp and /scratch->/home/node.# in /etc/beowulf/node_up). I installed the Intel C++ and Fortran compilers on the master and = successfully ran one of my (uniprocessor) benchmarks. I also installed = Intel's Math Kernel Library. This is where I ran into trouble. Intel's = BLAS test suite runs fine on the master (using the default Pentium Pro = target architecture). But, when I try to run the same test suite on one = of the slaves, I get a compilation error: ld: warning: libpthread.so.0, needed by = /opt/intel/mkl//lib/32/libguide.so, not found (try using -rpath or = -rpath-link) I looked for /lib/libpthread.so.0 on the master node using "ls /lib", = and it's there. But, when I looked for it on a slave node 0 (or 1) = using "bpsh 0 ls /lib", it's not there. "/usr/sbin/bplib -l" shows = /lib/libpthread.so.0 in the library list, but it apparently is not = accessible. My /etc/beowulf/config says that the libraries in /usr and /usr/lib are = supposed to be "automagically" made available on the slaves: libraries /lib /usr/lib But it doen's explain how. I don't know why some libraries are in the = /lib and /usr/lib directories on the slaves and others are not. I found = /usr/lib/beoboot/bin/setup_libs, which looks like it copies libraries to = the slaves, but it is commented out in /usr/lib/beoboot/bin/node_up (is = it obsolete?). The beoboot kit in the Clustermatic tarballs directory = has an rc.beowulf which looks like it does similar things, but I could = not find an rc.beowulf anywhere else on the master node, so I don't = think it is called either. Could someone please explain how and which libraries get copied to the = slave nodes, and what I have to do to get libpthread.so.0 (and, maybe = more later) included? Are all the libraries in the bplib list supposed = to be there? Is there a problem if the bplib list does not match the = libraries on the slaves? Could I have run out of RAM disk? (Which = brings up another question: how do I tell how much RAM disk each slave = has and how full it is?) Thanks in advance for your help, Larry Baker US Geological Survey P.S. My benchmark ran 16% faster on my 1.8A P4 with PC2100 DDR (VIA = chipset) than on my 667 MHz Tru64 Unix Alpha DS20E. |
From: J.A. M. <jam...@ab...> - 2002-06-19 22:29:38
|
Hi, bproc users and developers. I think I have found a weird bug...if it is so, and I am not misunderstanding something. I have a bunch of dual boxes in a bproc cluster. As MPI and so on do not know anything about sharing memory with local siblings (so a dual box with 1Gb of ram really allows you tu use 512 Mb per process), I want to do it manually. I try to spawn node slaves via bproc and on each node spawn POSIX threads. So I did something like for each node on cluster switch(bproc_rfork(node_i)) // Slave 0: do 2 times pthread_create wait for threads return; // Master pid: store pid to wait() on it wait for all pids It does not work. Slave gets hung on the first pthread_create, it does not return from the function call (at least it launches the thread target function). pstree shows this: process on process thread child master on slave controller thread bash(13526)---pt(13991)---pt(13992)---pt(13993)---pt(13994) But, just to test, I tried: for each node on cluster switch(fork()) // Local fork on master // Slave 0: bproc_move(node_i) do 2 times pthread_create wait for threads return; // Master pid: store pid to wait() on it wait for all pids AND IT _WORKS_ !! This is the output of pstree: process on process thread child master on slave controller threads bash(13526)---pt(13946)---pt(13947)---pt(13948)-+-pt(13949) `-pt(13950) So rfork is doing something different from a fork+move ?? Why it does not work directly ? Any ideas ? -- J.A. Magallon \ Software is like sex: It's better when it's free mailto:jam...@ab... \ -- Linus Torvalds, FSF T-shirt Linux werewolf 2.4.19-pre10-jam3, Mandrake Linux 8.3 (Cooker) for i586 gcc (GCC) 3.1.1 (Mandrake Linux 8.3 3.1.1-0.4mdk) |
From: J.A. M. <jam...@ab...> - 2002-06-15 00:58:13
|
Hi all. I write this here to see if you can give me some clue on a strange behaviour with the clone system call. Really I want a hint about if it is a kernel issue, a glibc(pthreads) one or a gcc issue. If sometimes it works looks like not a kernel issue, but who knows... Abstract: compiling a program with 'gcc -pthread' or just linking with -lpthread (even if I just do not call any pthread_xxx), makes the 'clone()' call (glibc) fail. It does not jump to the given function. Checked on both: gcc (GCC) 3.1.1 (Mandrake Linux 8.3 3.1.1-0.4mdk) gcc version 2.96 20000731 (Mandrake Linux 8.2 2.96-0.76mdk) Sample test: #include <sched.h> #include <signal.h> #include <stdio.h> #define STSZ (4*1024) int pslave(void *data); int main(int argc,char** argv) { int tid; char* stack; stack = (char*)valloc(STSZ); puts("about to clone..."); tid = clone(pslave,stack+STSZ-1,CLONE_VM|SIGCHLD,0); if (tid<0) { perror("clone"); exit(1); } puts("clone ok"); wait(0); free(stack); return 0; } int pslave(void *data) { puts("slave running"); sleep(1); puts("slave done"); return 0; } -- J.A. Magallon \ Software is like sex: It's better when it's free mailto:jam...@ab... \ -- Linus Torvalds, FSF T-shirt Linux werewolf 2.4.19-pre10-jam3, Mandrake Linux 8.3 (Cooker) for i586 gcc (GCC) 3.1.1 (Mandrake Linux 8.3 3.1.1-0.4mdk) |
From: J.A. M. <jam...@ab...> - 2002-06-14 00:25:15
|
Hi. Funny problem with a cluster of dual smp boxes. I am trying to rfork a process from the front-end and make each remote process split in two with pthreads. The problem is that thread creation gets blocked after the first thread. Below is a test program. The resul when compiled with ONLY_LOCAL is like: annwn:~/bp> bp bproc version: 3.1.10 nodes: 1 node 0 id -1 has 4 CPUs <============== run on master about to create thread... done about to create thread... done about to create thread... done about to create thread... done thread 0 on node 0 thread 1 on node 0 thread 2 on node 0 thread 3 on node 0 Running the program on a node via bpsh: annwn:~/bp> bpsh 1 bp bproc version: 3.1.10 nodes: 1 node 0 id 1 has 2 CPUs about to create thread... thread 0 on node 0 done about to create thread... thread 1 on node 0 done (SO you _can_ create threads after a move). And building without ONLY_LOCAL, and running on master: annwn:~/bp> bp bproc version: 3.1.10 nodes: 1 node 0 id 0 has 2 CPUs about to create thread... thread 0 on node 0 <IT STOPS HERE FOREVER> Any idea ??? Here is the test program (bp.cc, yup, g++ to build) #include <iostream> #include <sys/sysinfo.h> #include <sys/bproc.h> #include <sys/wait.h> #include <pthread.h> #include <errno.h> #include <unistd.h> #include <string.h> #include <stdlib.h> //#define ONLY_LOCAL using namespace std; int nnodes; int nself; void nslave(); int nprocs; int self; void* pslave(void *); int main(int argc,char** argv) { bproc_version_t info; if (bproc_version(&info)) { cout << "bproc_version: "; cout << strerror(errno) << endl; exit(1); } cout << "bproc version: " << info.version_string << endl; //nnodes = bproc_numnodes(); nnodes = 1; cout << "nodes: " << nnodes << endl; #ifndef ONLY_LOCAL int spawned = 0; for (int i=0; i<nnodes; i++) { nself = spawned; switch(bproc_rfork(i)) { case -1: cout << "bproc_fork on " << i << ": "; cout << strerror(errno) << endl; break; case 0: nslave(); return 0; break; default: spawned++; } } for (int i=0; i<spawned; i++) wait(0); #else nself = 0; nslave(); #endif return 0; } void nslave() { nprocs = get_nprocs_conf(); cout << "node " << nself << " id " << bproc_currnode(); cout << " has " << nprocs << " CPUs" << endl; pthread_t tid[nprocs]; int spawned = 0; for (int i=0; i<nprocs; i++) { cout << "about to create thread..." << endl; pthread_create(&tid[i],0,pslave,(void*)spawned); cout << "done" << endl; spawned++; } for (int i=0; i<spawned; i++) pthread_join(tid[i],0); } void* pslave(void *data) { int tself = int(data); cout << "thread " << tself << " on node " << nself << endl; sleep(2); return 0; } -- J.A. Magallon \ Software is like sex: It's better when it's free mailto:jam...@ab... \ -- Linus Torvalds, FSF T-shirt Linux werewolf 2.4.19-pre10-jam3, Mandrake Linux 8.3 (Cooker) for i586 gcc (GCC) 3.1.1 (Mandrake Linux 8.3 3.1.1-0.4mdk) |
From: Grant T. <gt...@sw...> - 2002-06-12 20:52:50
|
>>>>> Erik Arjan Hendriks <er...@he...> writes: > I find it surprising that the cache stuff is so expensive. Is the > hardware somewhat lacking in maintaining coherency between CPUs? No, the whole cache is on die along with the cores. Coherency is "free" to some extent, since both cores share most of the cache. It's also "free" wrt DMA, as many of the key periperals are on the inside of the cache and not the outside. What's not free is icache flushes, because in addition to being a little expensive by themselves, they occupy all cores by doign a nasty interprocess function call. Doing this for every page is a lot of work. I also ended up skipping the brute force zero compare on the freezing side. In our application we don't have many private pages which are full of zeroes, so we come out a little bit ahead here. -- Grant Taylor - x285 - http://pasta/~gtaylor/ Starent Networks - +1.978.851.1185 |
From: Erik A. H. <er...@he...> - 2002-06-12 20:33:46
|
On Thu, Jun 06, 2002 at 04:44:47PM -0400, Grant Taylor wrote: > >>>>> Erik Arjan Hendriks <er...@he...> writes: > >> - Removing seemingly uneeded icache flushes for non-executable pages > >> in load_map makes no difference, time-wise, on my platform. This > > > That's good. I don't find it surprising though since most of the > > addresses involved in the flush aren't going to be in the icache > > OK, I lied. Evidently that experiment was done with the wrong kernel > or something. In fact it's like 5X faster, as cache fluses are fairly > expensive on my (SMP) platform. Profile below. > > So you definitely want to stick an if around the flush_icache_range > call for the benefit of those of us with expensive ipi cache > nastiness: > > if (mmap_prot & PROT_EXEC) { > flush_icache_range(page.start, page.start + PAGE_SIZE); > } Cool. Patch added. > I also observe in my profile that the read's memcpy ("both_aligned" in > my profile) and the fault's memzero ("sb1_clear_page") each take > nearly half the time. I'm going to bodge together some way to skip > the page clear for this case; we already know we're overwriting the > whole page, so this should be OK as long as it gets carefully zeroed > on error. It'll be pretty ugly, though, so I suspect you won't > actually want it. [snip] I find it surprising that the cache stuff is so expensive. Is the hardware somewhat lacking in maintaining coherency between CPUs? - Erik |
From: Miguel C. <mc...@fc...> - 2002-06-10 17:35:32
|
Hello all. I'm testing Clustermatic on a cluster that has been running Scyld for a while and I would like to know if it is possible to compile Scyld's beomap and libbeostat with the new Bproc. The thing is that, as I mentioned, I've been usin Scyld for a few months now, and although I'd like to move to Clustermatic (if nothing else because it seems to be a more friendly and up to date project) there are a few things in Scyld that make it easier for the end user. I'm referring to Beowulf Batch Queue and Beostatus (with their web output) and the fact that mpirun does load-balancing at run time. This makes it really easy for users to submit jobs - they just check the status of the cluster with a browser and then submit a job without worrying where it will run. (the webmin modules for beosetup and beowulf batch queue are also fun for the admin) All these seem to depend on Scyld's beomap and libbeostat - is it possible to compile them with the new BProc? Or maybe there are good alternatives for these tools? Thanks again. miguel costa |
From: J.A. M. <jam...@ab...> - 2002-06-07 19:57:56
|
On 2002.06.07 David wrote: >I am having some problems compiling bproc-3.1.10. I receive the following >error: > >gcc -Wall -Wstrict-prototypes -O2 -fomit-frame-pointer -D__powerpc__ >-fsigned-char -msoft-float -pipe -ffixed-r2 -Wno-uninitialized -mmultiple >-mstring -D__KERNEL__ -DMODULE -DPACKAGE_VERSION='"3.1.10"' >-DPACKAGE_MAGIC='21306' -DENABLE_DEBUG -DLINUX_TCP_IS_BROKEN -I. -I../vmadump >-I/usr/src/linux/include -c ghost.c > >ghost.c:63: initializer element is not constant >ghost.c:63: (near initialization for `bproc_ghost_reqs') >ghost.c:63: initializer element is not constant >ghost.c: In function `bproc_kernel_thread': >ghost.c:279: warning: ignoring asm-specifier for non-static local variable >`retval' >make[1]: *** [ghost.o] Error 1 >make[1]: Leaving directory `/root/bproc-3.1.10/kernel' >make: *** [kernel_] Error 2 > >Any suggestions? > diff -ruN bproc-3.1.9/kernel/bproc.h bproc-3.1.9-j/kernel/bproc.h --- bproc/kernel/bproc.h 2002-02-19 23:25:47.000000000 +0100 +++ bproc-j/kernel/bproc.h 2002-03-29 11:52:43.000000000 +0100 @@ -630,10 +630,12 @@ #define BPROC_DEADREQ(r) ((r)->req.req == 0) #define BPROC_PENDING(r) ((!BPROC_DEADREQ(r))&&(!BPROC_ISRESPONSE((r)->req.req))) -#define EMPTY_BPROC_REQUEST_QUEUE(foo) \ - ((struct bproc_request_queue_t) {SPIN_LOCK_UNLOCKED,0, \ +#define EMPTY_BPROC_REQUEST_QUEUE_STATIC(foo) \ + {SPIN_LOCK_UNLOCKED,0, \ LIST_HEAD_INIT((foo).list),__WAIT_QUEUE_HEAD_INITIALIZER((foo).wait),\ - LIST_HEAD_INIT((foo).pending)}) + LIST_HEAD_INIT((foo).pending)} +#define EMPTY_BPROC_REQUEST_QUEUE(foo) \ + ((struct bproc_request_queue_t) EMPTY_BPROC_REQUEST_QUEUE_STATIC(foo)) extern atomic_t msg_count; static inline diff -ruN bproc-3.1.9/kernel/ghost.c bproc-3.1.9-j/kernel/ghost.c --- bproc/kernel/ghost.c 2002-03-08 20:26:31.000000000 +0100 +++ bproc-j/kernel/ghost.c 2002-03-29 11:52:59.000000000 +0100 @@ -60,7 +60,7 @@ DECLARE_WAIT_QUEUE_HEAD(ghost_wait); struct bproc_request_queue_t bproc_ghost_reqs = - EMPTY_BPROC_REQUEST_QUEUE(bproc_ghost_reqs); + EMPTY_BPROC_REQUEST_QUEUE_STATIC(bproc_ghost_reqs); int ghost_deliver_msg(pid_t pid, struct bproc_krequest_t *req) { -- J.A. Magallon # Let the source be with you... mailto:jam...@ab... Mandrake Linux release 8.3 (Cooker) for i586 Linux werewolf 2.4.19-pre10-jam2 #1 SMP vie jun 7 17:04:23 CEST 2002 i686 |
From: David <dg...@te...> - 2002-06-07 18:37:24
|
I am having some problems compiling bproc-3.1.10. I receive the followin= g=20 error: gcc -Wall -Wstrict-prototypes -O2 -fomit-frame-pointer -D__powerpc__=20 -fsigned-char -msoft-float -pipe -ffixed-r2 -Wno-uninitialized -mmultiple= =20 -mstring -D__KERNEL__ -DMODULE -DPACKAGE_VERSION=3D'"3.1.10"'=20 -DPACKAGE_MAGIC=3D'21306' -DENABLE_DEBUG -DLINUX_TCP_IS_BROKEN -I. -I../v= madump=20 -I/usr/src/linux/include -c ghost.c ghost.c:63: initializer element is not constant ghost.c:63: (near initialization for `bproc_ghost_reqs') ghost.c:63: initializer element is not constant ghost.c: In function `bproc_kernel_thread': ghost.c:279: warning: ignoring asm-specifier for non-static local variabl= e=20 `retval' make[1]: *** [ghost.o] Error 1 make[1]: Leaving directory `/root/bproc-3.1.10/kernel' make: *** [kernel_] Error 2 Any suggestions? David |
From: Grant T. <gt...@sw...> - 2002-06-06 21:59:21
|
>>>>> Grant Taylor <gt...@sw...> writes: > I'm going to bodge together some way to skip the page clear for this OK, I've done that. Here's my vmadump performance diff. Alas, this makes vmadump something less than a generic module, but if someday you hit some application where vmadump performance is really important, here it is... ===== include/linux/mm.h 1.2 vs edited ===== --- 1.2/include/linux/mm.h Thu Jan 10 16:40:15 2002 +++ edited/include/linux/mm.h Thu Jun 6 17:40:33 2002 @@ -103,6 +103,8 @@ #define VM_DONTEXPAND 0x00040000 /* Cannot expand with mremap() */ #define VM_RESERVED 0x00080000 /* Don't unmap it from swap_out */ +#define VM_NOCLEAR 0x00100000 /* Don't clear when faulting in anonymous pages */ + #define VM_STACK_FLAGS 0x00000177 #define VM_READHINTMASK (VM_SEQ_READ | VM_RAND_READ) ===== kernel/vmadump.c 1.4 vs edited ===== --- 1.4/kernel/vmadump.c Wed Mar 6 16:25:53 2002 +++ edited/kernel/vmadump.c Thu Jun 6 17:45:08 2002 @@ -430,21 +430,42 @@ } } +#define VM_SET_NOCLEAR_MAP_AT_ADDR(addr, noclear) { \ + if (addr) { \ + struct vm_area_struct *map; \ + down_write(¤t->mm->mmap_sem); \ + map = find_vma(current->mm, addr); \ + if (map) { \ + if (noclear) { \ + map->vm_flags |= VM_NOCLEAR; \ + } else { \ + map->vm_flags &= ~VM_NOCLEAR; \ + } \ + } \ + up_write(¤t->mm->mmap_sem); \ + } \ + } + /* Read in patched pages */ + VM_SET_NOCLEAR_MAP_AT_ADDR(addr, 1);/* don't clear page when faulting! */ r = read_kern(ctx, file, &page, sizeof(page)); while (r == sizeof(page) && page.start != ~0UL) { r = read_user(ctx, file, (void *) page.start, PAGE_SIZE); if (r != PAGE_SIZE) goto err; - flush_icache_range(page.start, page.start + PAGE_SIZE); + if (mmap_prot & PROT_EXEC) { + flush_icache_range(page.start, page.start + PAGE_SIZE); + } r = read_kern(ctx, file, &page, sizeof(page)); } if (r != sizeof(page)) goto err; + VM_SET_NOCLEAR_MAP_AT_ADDR(addr, 0); if (k_mprotect(head->start,head->end - head->start, mmap_prot)) printk("vmadump: thaw: mprotect failed. (ignoring)\n"); return 0; err: + VM_SET_NOCLEAR_MAP_AT_ADDR(addr, 0); if (r >= 0) r = -EIO; /* map short reads to EIO */ return r; } ===== mm/memory.c 1.2 vs edited ===== --- 1.2/mm/memory.c Thu Jan 10 16:40:56 2002 +++ edited/mm/memory.c Thu Jun 6 17:23:23 2002 @@ -1174,7 +1174,10 @@ page = alloc_page(GFP_HIGHUSER); if (!page) goto no_mem; - clear_user_highpage(page, addr); + if (!(vma->vm_flags & VM_NOCLEAR)) { + /* Clear (unless hackish noclear flag was set by ie vmadump) */ + clear_user_highpage(page, addr); + } spin_lock(&mm->page_table_lock); if (!pte_none(*page_table)) { -- Grant Taylor - x285 - http://pasta/~gtaylor/ Starent Networks - +1.978.851.1185 |
From: Grant T. <gt...@sw...> - 2002-06-06 20:45:00
|
>>>>> Erik Arjan Hendriks <er...@he...> writes: >> - Removing seemingly uneeded icache flushes for non-executable pages >> in load_map makes no difference, time-wise, on my platform. This > That's good. I don't find it surprising though since most of the > addresses involved in the flush aren't going to be in the icache OK, I lied. Evidently that experiment was done with the wrong kernel or something. In fact it's like 5X faster, as cache fluses are fairly expensive on my (SMP) platform. Profile below. So you definitely want to stick an if around the flush_icache_range call for the benefit of those of us with expensive ipi cache nastiness: if (mmap_prot & PROT_EXEC) { flush_icache_range(page.start, page.start + PAGE_SIZE); } I also observe in my profile that the read's memcpy ("both_aligned" in my profile) and the fault's memzero ("sb1_clear_page") each take nearly half the time. I'm going to bodge together some way to skip the page clear for this case; we already know we're overwriting the whole page, so this should be OK as long as it gets carefully zeroed on error. It'll be pretty ugly, though, so I suspect you won't actually want it. With needless flushes: 218 smp_call_function 0.5240 34 both_aligned 0.3148 24 local_sb1_flush_icache_range 0.1200 10 sb1_flush_icache_range_ipi 0.2500 8 sb1_clear_page 0.2000 5 cleanup_both_aligned 0.0781 4 kunmap_high 0.0135 4 do_anonymous_page 0.0069 3 kmap_high 0.0056 2 zap_page_range 0.0017 2 sb1_copy_page 0.0167 2 pte_alloc 0.0047 2 do_shmem_file_read 0.0054 Without needless flushes: 34 both_aligned 0.3148 12 sb1_clear_page 0.3000 7 cleanup_both_aligned 0.1094 4 sb1_sanitize_tlb 0.0312 2 kmap_high 0.0037 2 __free_pages_ok 0.0017 1 zap_page_range 0.0009 1 smp_call_function 0.0024 1 shmem_getpage_locked 0.0007 1 shmem_getpage 0.0025 1 shmem_file_read 0.0078 1 sb1_copy_page 0.0083 1 sb1___flush_cache_all_ipi 0.0078 -- Grant Taylor - x285 - http://pasta/~gtaylor/ Starent Networks - +1.978.851.1185 |
From: Dave S. <dav...@vi...> - 2002-06-06 17:27:09
|
OK - I'm a newbe here - Does anybody know of a document somewhere on how to setup Beowulf clients / disks ?? I have my nodes up and and running with local disk swap and and some empty filesystem - except for what is installed by node_up. Thank you in advance ds |
From: Grant T. <gt...@sw...> - 2002-06-05 00:08:44
|
>>>>> Erik Arjan Hendriks <er...@he...> writes: > That's good. I don't find it surprising though since most of the > addresses involved in the flush aren't going to be in the icache > anyway. Flushing stuff out to main memory early shouldn't cause any > problem either since you're going to end up forcing that to happen > anyway. Ah, this is true, I guess. At these sizes, it probably does stomp all over the cache plenty. > Hrm. That's pretty slow. I'm seeing 550m+bits/sec on our alpha > cluster here (IP over myrinet). Hmm. We're using gig ethernet. The ethernet is an on-chip mac with a series of mind-blowingly ugly hacks in the driver to make it deal with hardware bugs, so it is performance-limited to some extent. However, it's not limited to 100mbit by any stretch of the imagination; netperf goes at 500-800. > Every new page is going to cause a page fault. Yeah, this is what I figured. > It could be that page faults are particularly slow or something. > That certainly wouldn't surprise me. Me either, certainly not on this particular chip. Is there any way to "bulk allocate" them into existence before doing the reads? Seems like there's got to be some way to save time here. I'll go poke. Maybe there's some ugly way to speed faults for vmadump's special known-future case or for just my platform. Perhaps the profiler will suggest something. >> I don't suppose there would be a way to map the pages directly > No, not with the current dump file format. The basic problem is > VMADump goes out of its way to send less than the whole process > image. Therefore the dump file is a bunch of pages that aren't > necessarily contiguous. There's a tiny big of goop (page address) > in front of every page in the file too. Hmm. For us the heap is 95% of the thing, and it's pretty much all dirty. Too bad you've got a per page header ;( >> In the end I make the sender wait in a loop for the TCP_INFO >> ioctl to return a TCP state of TCP_CLOSE. No way should this be > Whoa. That's messed up. I've run into number of TCP bugs including > one that results in a spurious connection reset but nothing that > matches that. VMADump certainly should not cause any problems since > the only I/O related calls it makes are read and write. Oh, yes, it happens pretty much the same for all-userspace code using temp files. I'm quite sure it's specific to us; if it generally worked like this your average httpd would totally not work. > The reset bug I saw looks like this: If two machines are connected (A > and B) and A calls shutdown(fd, 1) (i.e. no more sends from A, send > TCP FIN), then B will sometimes get connection reset by peer on > writes. The whole Unix api is a little iffy about a lot of this stuff; often it's impossible for just a return value and errno to express what's going on. I've been working with O(1) poll replacements lately, boy do some of these interact oddly with signals/tcp/etc. -- Grant Taylor - x285 - http://pasta/~gtaylor/ Starent Networks - +1.978.851.1185 |
From: Erik A. H. <er...@he...> - 2002-06-04 23:24:54
|
On Tue, Jun 04, 2002 at 04:58:30PM -0400, Grant Taylor wrote: > > On Thu, Mar 07, 2002 at 07:20:28PM -0500, Grant Taylor wrote: > >> Anyway, it's a little frentic here now, but after things settle down > >> I'll put together a clean patch for mips. > > Well, I haven't gotten to the clean patch point yet (I still have > nasty hacks in syscall entry to make it work) but now we're at the > performance tuning stage... > > I'm trying to make vmadump run faster. Currently it looks like the > following things are true: > > - Freezing is faster than thawing by around a factor of 2. That's interesting... > - Having the dumping kernel connected directly to the undumping > kernel* pipelines the whole process and makes thawing the only wall > time spent. So that's a 33% speedup. This is what BProc does, btw. > - Having four CPUs migrate processes to four other CPUs all in > parallel pipelines that; so that's a 4X speedup. > > - Removing seemingly uneeded icache flushes for non-executable pages > in load_map makes no difference, time-wise, on my platform. This > is a little surprising given 250MB of cache flushing that went > away, but so be it. That's good. I don't find it surprising though since most of the addresses involved in the flush aren't going to be in the icache anyway. Flushing stuff out to main memory early shouldn't cause any problem either since you're going to end up forcing that to happen anyway. > - The network is not the bottleneck. My cluster will TCP within > itself at nearly a gigabit, and the dumps go at maybe 10MB/s. Hrm. That's pretty slow. I'm seeing 550m+bits/sec on our alpha cluster here (IP over myrinet). > This leaves me wondering why thawing is so much more expensive than > freezing. It's curious that writing the arriving dump to a file in > tmpfs is way faster than thawing. My assumption is that the extra > time is sunk in the allocation of pages through mmap; this is the main > difference between the various paths. Is each read of a page's data > triggering a page fault, or are the pages allocated at mmap time? Is > there something I can do to speed this up? Every new page is going to cause a page fault. mmap really just sets up a memory region so that the fault handler knows what to do. I don't have any experience with Linux on MIPS myself. It could be that page faults are particularly slow or something. That certainly wouldn't surprise me. > I don't suppose there would be a way to map the pages directly from a > dump file and have a sort of execute-in-place semantic between the > file and new process? I'm statistically unlikely to touch pages right > away in the process, so some of this work seems unnecessary. No, not with the current dump file format. The basic problem is VMADump goes out of its way to send less than the whole process image. Therefore the dump file is a bunch of pages that aren't necessarily contiguous. There's a tiny big of goop (page address) in front of every page in the file too. > * This was a PITA to make work right. Does bproc do this? BProc process migration runs freeze and thaw directly connected with a TCP socket. > We found that merely closing the TCP socket on the transmitting > end would cause a premature connection reset on the receiving end > and thus hose the thaw. No variation on lingering or cloexec or > even sleeping worked reliably. In the end I make the sender wait > in a loop for the TCP_INFO ioctl to return a TCP state of > TCP_CLOSE. No way should this be necessary; something is horribly > amiss in my kernel, I think. Whoa. That's messed up. I've run into number of TCP bugs including one that results in a spurious connection reset but nothing that matches that. VMADump certainly should not cause any problems since the only I/O related calls it makes are read and write. The reset bug I saw looks like this: If two machines are connected (A and B) and A calls shutdown(fd, 1) (i.e. no more sends from A, send TCP FIN), then B will sometimes get connection reset by peer on writes. - Erik -- Erik Arjan Hendriks Printed On 100 Percent Recycled Electrons er...@he... Contents may settle during shipment |
From: Grant T. <gt...@sw...> - 2002-06-04 20:58:39
|
> On Thu, Mar 07, 2002 at 07:20:28PM -0500, Grant Taylor wrote: >> Anyway, it's a little frentic here now, but after things settle down >> I'll put together a clean patch for mips. Well, I haven't gotten to the clean patch point yet (I still have nasty hacks in syscall entry to make it work) but now we're at the performance tuning stage... I'm trying to make vmadump run faster. Currently it looks like the following things are true: - Freezing is faster than thawing by around a factor of 2. - Having the dumping kernel connected directly to the undumping kernel* pipelines the whole process and makes thawing the only wall time spent. So that's a 33% speedup. - Having four CPUs migrate processes to four other CPUs all in parallel pipelines that; so that's a 4X speedup. - Removing seemingly uneeded icache flushes for non-executable pages in load_map makes no difference, time-wise, on my platform. This is a little surprising given 250MB of cache flushing that went away, but so be it. - The network is not the bottleneck. My cluster will TCP within itself at nearly a gigabit, and the dumps go at maybe 10MB/s. This leaves me wondering why thawing is so much more expensive than freezing. It's curious that writing the arriving dump to a file in tmpfs is way faster than thawing. My assumption is that the extra time is sunk in the allocation of pages through mmap; this is the main difference between the various paths. Is each read of a page's data triggering a page fault, or are the pages allocated at mmap time? Is there something I can do to speed this up? I don't suppose there would be a way to map the pages directly from a dump file and have a sort of execute-in-place semantic between the file and new process? I'm statistically unlikely to touch pages right away in the process, so some of this work seems unnecessary. * This was a PITA to make work right. Does bproc do this? We found that merely closing the TCP socket on the transmitting end would cause a premature connection reset on the receiving end and thus hose the thaw. No variation on lingering or cloexec or even sleeping worked reliably. In the end I make the sender wait in a loop for the TCP_INFO ioctl to return a TCP state of TCP_CLOSE. No way should this be necessary; something is horribly amiss in my kernel, I think. -- Grant Taylor http://www.picante.com/~gtaylor/ Starent Networks http://www.starentnetworks.com/ |
From: Daniel W. <wi...@ci...> - 2002-06-01 11:19:52
|
> What does the environment look like? Are you using a Clustermatic > or Scyld-like environment? Specifically, I'm wondering if you're > using the normal NSS stuff for lookups or using beonss/bproc. Thanks, Erik. We're using the Clubmask environment, plain standard RH7.2 nss stuff. We have two clusters which are operating fine under this environment (24 nodes & 64 nodes), and one which isn't (64 nodes -- and I'm trying to debug via bproc to determine what is different). All Dell 2550 front ends to Dell 1550 nodes. > As far as strace goes, you can take two routes with bpsh. Both > require BProc version 3 or later. [...] > strace bpsh -N 0 command > Another alternative is to use the undocumented "-S" flag. 3.1.9, so I'll try your recommendations out, thanks again! Dan W. -- -- Daniel Widyono http://www.cis.upenn.edu/~widyono -- Linux Cluster Group, CIS Dept., SEAS, University of Pennsylvania -- Mail: Rm 556, CIS Dept 200 S 33rd St Philadelphia, PA 19104 |
From: Erik A. H. <er...@he...> - 2002-05-31 19:21:24
|
On Wed, May 29, 2002 at 12:39:53PM -0400, Daniel Widyono wrote: > Any ideas? How do I go about debugging bpsh calls in general? strace output > on master doesn't seem promising. strace -f gives: What does the environment look like? Are you using a Clustermatic or Scyld-like environment? Specifically, I'm wondering if you're using the normal NSS stuff for lookups or using beonss/bproc. beonss is a bit quirky when you ask for user names. As far as strace goes, you can take two routes with bpsh. Both require BProc version 3 or later. strace bpsh -N 0 command This turns off ALL I/O forwarding. This means that you don't get to see output but bpsh *will* exec the command directly so strace will be attached to the right process. Another alternative is to use the undocumented "-S" flag. This stops the child process before doing bproc_execmove. That gives you a window to attach strace to the child before it runs. For example: $ bpsh 0 -S uptime <it hangs here> In another window: $ ps xf | grep uptime 20172 pts/10 S 0:00 grep uptime 20169 pts/9 S 0:00 bpsh 0 -S uptime 20170 pts/9 T 0:00 \_ bpsh 0 -S uptime $ strace -p 20170 --- SIGSTOP (Stopped (signal)) --- SYS_291(0x304, 0, 0x11ffff110, 0x11ffff8d0, 0x11ffff8e0) = 0 SYS_0(0x304, 0, 0x11ffff110, 0, 0x11ffff8e0) = -1 ERRNO_339 (errno 339) SYS_0(0x11ffff2b8, 0x20000005870, 0, 0, 0x2) = 17 open("/etc/ld.so.preload", O_RDONLY) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY) = 3 .... watch uptime run to completion... - Erik -- Erik Arjan Hendriks Printed On 100 Percent Recycled Electrons er...@he... Contents may settle during shipment |
From: Erik A. H. <er...@he...> - 2002-05-31 19:04:39
|
On Wed, May 29, 2002 at 02:58:29PM -0700, Dave Shepherd wrote: > Hello - I hope I've found the right mailing list for this question. > > I've been trying to load bproc module and I'm just not having any luck. > Another set of eyes a due. > > steps take are as follows: > 1) new Red Hat 7.2 OS load - (also tries 7.3 - does the same thing) > > 2)load clustermatic rpms from CD > cd /mnt/cdrom/RPMS/i686 > rpm -ivh * > I get some old kernel & glibc depend errors - so have to use > (rpm -ivh --nodeps --force $FAILED.rpms) There's a few conflicts with 7.3 tho nothing major. I thought 7.2 basically worked. Oh well. > 3) edit /etc/lilo.conf > added: > image=/boot/vmlinuz-2.4.18.lanl.16smp > label=linux > initrd=/boot/initrd-2.4.18-lanl.16smp.img > read-only > root=/dev/hda5 > > 4) run lilo -v to gen /boot/boot.b > > 5) edit /etc/beowulf/config to reflect my interface > interface eth0:0 192.168.45.1 255.255.255.0 > nodes 10 > iprange 0 192.168.45.20 192.168.45.30 > > 6) run beoboot -2 -i -o boot -k /boot/vmlinuz-2.4.18.lanl.16 & reboot > > 7) /etc/rc3.d/S72beowulf > system boot up properly except for the following error message > Configuring network interface (eth0:0) [ OK ] > Loading modules:modprobe: Can't locate module bproc [FAILED] There's a screw-up in the bproc RPMs. This should be easily fixed by running "depmod -a" after booting the new kernel. > 8) OK - try manual load > # insmod bproc > Using /lib/modules/2.4.18-lanl.16smp/bproc/bproc.o > /lib/modules/2.4.18-lanl.16smp/bproc/bproc.o: unresolved symbol > vmadump_freeze_proc > /lib/modules/2.4.18-lanl.16smp/bproc/bproc.o: unresolved symbol > vmadump_thaw_proc > /lib/modules/2.4.18-lanl.16smp/bproc/bproc.o: unresolved symbol do_vmadump bproc requires the vmadump module to be loaded. "depmod -a" followed by "modprobe bproc" should do it. - Erik -- Erik Arjan Hendriks Printed On 100 Percent Recycled Electrons er...@he... Contents may settle during shipment |
From: Daniel W. <wi...@ci...> - 2002-05-31 13:18:08
|
Good point, I should have been more thorough in my test case. However, main() { struct passwd *pwent = getpwuid(0); printf("pwent = %p\n", (void *)pwent); if (pwent) { printf("pwent->pw_name = \"%s\"\n",pwent->pw_name); } } [root@alpha pam_bproc]# ./a.out pwent = 0x4016008c pwent->pw_name = "root" [root@alpha pam_bproc]# bpsh 0 ./a.out bpsh: Child process exited abnormally. This happens in getpwuid, not in my app. I also confirmed that this runs fine on the node when not run via bpsh. One more check, not using bpsh: main() { int pid; if (pid = bproc_rfork(0)) { wait(pid); } else if (pid == 0) { struct passwd *pwent = getpwuid(0); printf("pwent = %p\n", (void *)pwent); if (pwent) { printf("pwent->pw_name = \"%s\"\n",pwent->pw_name); } } else { perror("could not rfork\n"); exit (-1); } } This works fine. Signs seem to point to bpsh interaction. Any other bproc debugging advice? Thanks, Dan W. On Wed, May 29, 2002 at 02:18:55PM -0400, Jag wrote: > Check if pwent is NULL. Unless you managed to properlly setup all the > nss stuff on the slave node, it won't be able to find the info for uid > 0, and will thus return NULL, and your printf statement will likely > cause a segfault. -- -- Daniel Widyono http://www.cis.upenn.edu/~widyono -- Linux Cluster Group, CIS Dept., SEAS, University of Pennsylvania -- Mail: Rm 556, CIS Dept 200 S 33rd St Philadelphia, PA 19104 |
From: Dave S. <dav...@vi...> - 2002-05-29 21:58:22
|
Hello - I hope I've found the right mailing list for this question. I've been trying to load bproc module and I'm just not having any luck. Another set of eyes a due. steps take are as follows: 1) new Red Hat 7.2 OS load - (also tries 7.3 - does the same thing) 2)load clustermatic rpms from CD cd /mnt/cdrom/RPMS/i686 rpm -ivh * I get some old kernel & glibc depend errors - so have to use (rpm -ivh --nodeps --force $FAILED.rpms) 3) edit /etc/lilo.conf added: image=/boot/vmlinuz-2.4.18.lanl.16smp label=linux initrd=/boot/initrd-2.4.18-lanl.16smp.img read-only root=/dev/hda5 4) run lilo -v to gen /boot/boot.b 5) edit /etc/beowulf/config to reflect my interface interface eth0:0 192.168.45.1 255.255.255.0 nodes 10 iprange 0 192.168.45.20 192.168.45.30 6) run beoboot -2 -i -o boot -k /boot/vmlinuz-2.4.18.lanl.16 & reboot 7) /etc/rc3.d/S72beowulf system boot up properly except for the following error message Configuring network interface (eth0:0) [ OK ] Loading modules:modprobe: Can't locate module bproc [FAILED] 8) OK - try manual load # insmod bproc Using /lib/modules/2.4.18-lanl.16smp/bproc/bproc.o /lib/modules/2.4.18-lanl.16smp/bproc/bproc.o: unresolved symbol vmadump_freeze_proc /lib/modules/2.4.18-lanl.16smp/bproc/bproc.o: unresolved symbol vmadump_thaw_proc /lib/modules/2.4.18-lanl.16smp/bproc/bproc.o: unresolved symbol do_vmadump Other Info: # uname -a Linux sandpiper 2.4.18-lanl.16.smp #1 SMP Tue Mar 5 21:11:46 MST 2002 i684 unknown # rpm -qa | grep kernel kernel-utils-2.4-7.4 kernel-smp-2.4.18-lanl.16 kernel-2.4.18-lanl.16 kernel-headers-2.4.18-lanl.16 kernel-source-2.4.18-lanl.16 kernel-doc-2.4.18-lanl.16 #rpm -qa | grep bproc bproc-3.1.9-1 bproc-libs-3.1.9-1 bproc-modules-3.1.9-1.k2.4.18_lanl.16 bproc-modules-smp-3.1.9-1.k2.4.18_lanl.16 bproc-devel-3.1.9-1 #rpm -aq | grep beo beoboot-lanl.1.2-1 beonss-1.0.12-lanl.2 beoboot-modules-lanl.1.2-12.4.18_lanl.16 That's about it. Dav...@Vi... |
From: Jag <ag...@li...> - 2002-05-29 18:20:38
|
On Wed, 2002-05-29 at 12:39, Daniel Widyono wrote: > #include <stdio.h> > #include <sys/bproc.h> > #include <pwd.h> > #include <sys/types.h> >=20 > main() { > struct passwd *pwent =3D getpwuid(0); > printf("%s\n",pwent->pw_name); > } >=20 >=20 > # bpsh 0 ./a.out > bpsh: Child process exited abnormally. >=20 >=20 > Same for getgrgid().=20 Check if pwent is NULL. Unless you managed to properlly setup all the nss stuff on the slave node, it won't be able to find the info for uid 0, and will thus return NULL, and your printf statement will likely cause a segfault. |