You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(25) |
Nov
|
Dec
(22) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(13) |
Feb
(22) |
Mar
(39) |
Apr
(10) |
May
(26) |
Jun
(23) |
Jul
(38) |
Aug
(20) |
Sep
(27) |
Oct
(76) |
Nov
(32) |
Dec
(11) |
2003 |
Jan
(8) |
Feb
(23) |
Mar
(12) |
Apr
(39) |
May
(1) |
Jun
(48) |
Jul
(35) |
Aug
(15) |
Sep
(60) |
Oct
(27) |
Nov
(9) |
Dec
(32) |
2004 |
Jan
(8) |
Feb
(16) |
Mar
(40) |
Apr
(25) |
May
(12) |
Jun
(33) |
Jul
(49) |
Aug
(39) |
Sep
(26) |
Oct
(47) |
Nov
(26) |
Dec
(36) |
2005 |
Jan
(29) |
Feb
(15) |
Mar
(22) |
Apr
(1) |
May
(8) |
Jun
(32) |
Jul
(11) |
Aug
(17) |
Sep
(9) |
Oct
(7) |
Nov
(15) |
Dec
|
From: <er...@he...> - 2003-09-17 19:05:30
|
On Wed, Sep 17, 2003 at 12:16:27PM -0400, Nicholas Henke wrote: > Hello~ > I am getting a repeatable oops with either 2.4.20 or 2.4.21 on bproc > 3.2.5 or 3.2.6. Attached is the oops trace. The oops is not fatal all of > the time, but left unchecked, it really makes a mess of things. Any > ideas? This traceback is 99% networking calls. The only BProc part is the call to sock_read. Plus, it seems to have died in an IRQ or something like that. This is almost certainly a network driver or network stack bug. Is it repeatable w/ different networking hardware? (Different hardware what uses a different driver.) - Erik > vmadump: 1.69 Erik Hendriks <er...@he...> > bproc: Beowulf Distributed Process Space Version 3.2.6 > bproc: (C) 1999-2002 Erik Hendriks <er...@he...> > do_IRQ: stack overflow: 924 > c024a365 0000039c 00000001 de92d280 de92d280 de92d280 dfba0e00 c02440a4 > de92d280 00000000 de92d280 de92d280 de92d280 dfba0e00 debac180 00000018 > 00000018 ffffff12 c01df575 00000010 00000202 de92d280 fffffff4 c01df5ec > Call Trace: [<c01df575>] skb_release_data [kernel] 0x15 (0xde57ca9c)) > [<c01df5ec>] kfree_skbmem [kernel] 0xc (0xde57cab0)) > [<c01df76e>] __kfree_skb [kernel] 0x11e (0xde57cac0)) > [<c022bcf2>] packet_rcv_spkt [kernel] 0x1b2 (0xde57cacc)) > [<c01df5ec>] kfree_skbmem [kernel] 0xc (0xde57caec)) > [<c01df76e>] __kfree_skb [kernel] 0x11e (0xde57cafc)) > [<c022bcf2>] packet_rcv_spkt [kernel] 0x1b2 (0xde57cb08)) > [<c01e32cf>] dev_queue_xmit_nit [kernel] 0x8f (0xde57cb28)) > [<c01ed560>] qdisc_restart [kernel] 0x60 (0xde57cb48)) > [<e08ff86a>] speedo_start_xmit [eepro100] 0x18a (0xde57cb54)) > [<c01e34ee>] dev_queue_xmit [kernel] 0x14e (0xde57cb70)) > [<c01ed514>] qdisc_restart [kernel] 0x14 (0xde57cb88)) > [<c01e34ee>] dev_queue_xmit [kernel] 0x14e (0xde57cbac)) > [<c01fb7a2>] ip_output [kernel] 0x102 (0xde57cbc4)) > [<c01fbbd0>] ip_queue_xmit [kernel] 0x3c0 (0xde57cbf8)) > [<c01fb7a2>] ip_output [kernel] 0x102 (0xde57cc00)) > [<c01fbbd0>] ip_queue_xmit [kernel] 0x3c0 (0xde57cc34)) > [<c02111be>] tcp_v4_send_check [kernel] 0x6e (0xde57cc98)) > [<c020bc15>] tcp_transmit_skb [kernel] 0x565 (0xde57ccc0)) > [<c01df40f>] alloc_skb [kernel] 0xef (0xde57cd1c)) > [<c020e191>] tcp_send_ack [kernel] 0xc1 (0xde57cd34)) > [<c01deea3>] sock_def_wakeup [kernel] 0x33 (0xde57cd4c)) > [<c020a85a>] tcp_rcv_synsent_state_process [kernel] 0x30a (0xde57cd58)) > [<c01fef70>] tcp_rfree [kernel] 0x0 (0xde57cd6c)) > [<c01def59>] sock_def_readable [kernel] 0x39 (0xde57cd74)) > [<c01fef70>] tcp_rfree [kernel] 0x0 (0xde57cd84)) > [<c020a0c9>] tcp_rcv_established [kernel] 0x429 (0xde57cd90)) > [<c020ab6e>] tcp_rcv_state_process [kernel] 0xbe (0xde57cde0)) > [<c01def59>] sock_def_readable [kernel] 0x39 (0xde57ce04)) > [<c01fef70>] tcp_rfree [kernel] 0x0 (0xde57ce14)) > [<c020a0c9>] tcp_rcv_established [kernel] 0x429 (0xde57ce20)) > [<c01fb7a2>] ip_output [kernel] 0x102 (0xde57ce5c)) > [<c02120f8>] tcp_v4_do_rcv [kernel] 0x38 (0xde57ce74)) > [<c01fbbd0>] ip_queue_xmit [kernel] 0x3c0 (0xde57ce90)) > [<c021264d>] tcp_v4_rcv [kernel] 0x46d (0xde57cea4)) > [<c01e07d4>] skb_checksum [kernel] 0x54 (0xde57ced8)) > [<c0212191>] tcp_v4_do_rcv [kernel] 0xd1 (0xde57cf04)) > [<c021202f>] tcp_v4_checksum_init [kernel] 0x7f (0xde57cf1c)) > [<c021264d>] tcp_v4_rcv [kernel] 0x46d (0xde57cf34)) > [<c01f88c3>] ip_local_deliver [kernel] 0xf3 (0xde57cf58)) > [<c01f606b>] ip_route_input [kernel] 0x3b (0xde57cf60)) > [<c01f8cb5>] ip_rcv [kernel] 0x355 (0xde57cfa0)) > [<c01df5ec>] kfree_skbmem [kernel] 0xc (0xde57cfd0)) > [<c01f88c3>] ip_local_deliver [kernel] 0xf3 (0xde57cfe8)) > [<c01f606b>] ip_route_input [kernel] 0x3b (0xde57cff0)) > [<c01f8cb5>] ip_rcv [kernel] 0x355 (0xde57d030)) > [<c01e37f0>] netif_rx [kernel] 0xc0 (0xde57d03c)) > [<c01df5ec>] kfree_skbmem [kernel] 0xc (0xde57d060)) > [<c01df76e>] __kfree_skb [kernel] 0x11e (0xde57d070)) > [<c022bcf2>] packet_rcv_spkt [kernel] 0x1b2 (0xde57d07c)) > [<c01ed514>] qdisc_restart [kernel] 0x14 (0xde57d0a0)) > [<c01e3e8f>] net_rx_action [kernel] 0x9f (0xde57d0b8)) > [<c01e3c99>] netif_receive_skb [kernel] 0x199 (0xde57d0d8)) > [<c01e3d49>] process_backlog [kernel] 0x79 (0xde57d118)) > [<c010a920>] do_IRQ [kernel] 0x100 (0xde57d134)) > [<c01e3e8f>] net_rx_action [kernel] 0x9f (0xde57d148)) > [<c012137b>] do_softirq [kernel] 0x6b (0xde57d180)) > [<c01e5374>] .text.lock.dev [kernel] 0x8e (0xde57d19c)) > [<c01f606b>] ip_route_input [kernel] 0x3b (0xde57d1bc)) > [<c01fb7a2>] ip_output [kernel] 0x102 (0xde57d1f8)) > [<c01fbbd0>] ip_queue_xmit [kernel] 0x3c0 (0xde57d22c)) > [<c01df76e>] __kfree_skb [kernel] 0x11e (0xde57d24c)) > [<c01e3a22>] net_tx_action [kernel] 0x62 (0xde57d258)) > [<c012137b>] do_softirq [kernel] 0x6b (0xde57d274)) > [<c0117ea0>] do_page_fault [kernel] 0x0 (0xde57d298)) > [<c0108d84>] error_code [kernel] 0x34 (0xde57d2a0)) > [<c02111be>] tcp_v4_send_check [kernel] 0x6e (0xde57d2cc)) > [<c020bc15>] tcp_transmit_skb [kernel] 0x565 (0xde57d2f4)) > [<c01e162c>] skb_copy_datagram_iovec [kernel] 0x4c (0xde57d334)) > [<c01df40f>] alloc_skb [kernel] 0xef (0xde57d350)) > [<c020e191>] tcp_send_ack [kernel] 0xc1 (0xde57d368)) > [<c01df5ec>] kfree_skbmem [kernel] 0xc (0xde57d374)) > [<c0202a75>] tcp_recvmsg [kernel] 0x7e5 (0xde57d38c)) > [<c021eac9>] inet_recvmsg [kernel] 0x39 (0xde57d3d0)) > [<c021eac9>] inet_recvmsg [kernel] 0x39 (0xde57d3f0)) > [<c01dbe91>] sock_recvmsg [kernel] 0x31 (0xde57d41c)) > [<c01dbf98>] sock_read [kernel] 0x88 (0xde57d484)) > [<e0918ab8>] k_read_u_f [bproc] 0x34 (0xde57d4c8)) > [<e0910316>] read_req_file_user [bproc] 0x5e (0xde57d4e8)) > [<e09106d0>] vmadump_read_file [bproc] 0x0 (0xde57d514)) > [<e0907134>] read_user [vmadump] 0x44 (0xde57d518)) > [<e0907eb7>] load_map [vmadump] 0x1ab (0xde57d548)) > [<e090717d>] read_kern [vmadump] 0x2d (0xde57d588)) > [<e0908385>] vmadump_thaw_proc [vmadump] 0x45d (0xde57d5a8)) > [<c01df40f>] alloc_skb [kernel] 0xef (0xde57d5c4)) > [<c01df76e>] __kfree_skb [kernel] 0x11e (0xde57d628)) > [<c0207570>] tcp_clean_rtx_queue [kernel] 0x1b0 (0xde57d630)) > [<c0208ad6>] tcp_data_queue [kernel] 0x2b6 (0xde57d674)) > [<c0207a98>] tcp_ack [kernel] 0x138 (0xde57d6a0)) > [<c020b4be>] tcp_rcv_state_process [kernel] 0xa0e (0xde57d6c4)) > [<c020e191>] tcp_send_ack [kernel] 0xc1 (0xde57d6dc)) > [<c01fef70>] tcp_rfree [kernel] 0x0 (0xde57d6f0)) > [<c020a0c9>] tcp_rcv_established [kernel] 0x429 (0xde57d6fc)) > [<c01e07d4>] skb_checksum [kernel] 0x54 (0xde57d7b4)) > [<c0212191>] tcp_v4_do_rcv [kernel] 0xd1 (0xde57d7e0)) > [<c021202f>] tcp_v4_checksum_init [kernel] 0x7f (0xde57d7f8)) > [<c021264d>] tcp_v4_rcv [kernel] 0x46d (0xde57d810)) > [<c01df5ec>] kfree_skbmem [kernel] 0xc (0xde57d8ac)) > [<c01f88c3>] ip_local_deliver [kernel] 0xf3 (0xde57d8c4)) > [<c01f606b>] ip_route_input [kernel] 0x3b (0xde57d8cc)) > [<c011d016>] ll_copy_to_user [kernel] 0x46 (0xde57d8dc)) > [<c011d016>] ll_copy_to_user [kernel] 0x46 (0xde57d8fc)) > [<c011d016>] ll_copy_to_user [kernel] 0x46 (0xde57d90c)) > [<c011d016>] ll_copy_to_user [kernel] 0x46 (0xde57d92c)) > [<c01e0fb8>] memcpy_toiovec [kernel] 0x38 (0xde57d950)) > [<c01e162c>] skb_copy_datagram_iovec [kernel] 0x4c (0xde57d974)) > [<c0201e5e>] cleanup_rbuf [kernel] 0xae (0xde57d994)) > [<c0201e5e>] cleanup_rbuf [kernel] 0xae (0xde57d9b4)) > [<c0202a75>] tcp_recvmsg [kernel] 0x7e5 (0xde57d9cc)) > [<c021eac9>] inet_recvmsg [kernel] 0x39 (0xde57da10)) > [<c021eac9>] inet_recvmsg [kernel] 0x39 (0xde57da30)) > [<c01dbe91>] sock_recvmsg [kernel] 0x31 (0xde57da5c)) > [<c01fefbe>] tcp_poll [kernel] 0x2e (0xde57da7c)) > [<c01dbf98>] sock_read [kernel] 0x88 (0xde57dac4)) > [<e0918ab8>] k_read_u_f [bproc] 0x34 (0xde57db08)) > [<e0910375>] read_req_file_kern [bproc] 0x2d (0xde57db58)) > [<e0911c32>] do_recv [bproc] 0x522 (0xde57db78)) > [<c01193cc>] schedule [kernel] 0x48c (0xde57dbc0)) > [<c0118e27>] schedule_timeout [kernel] 0x17 (0xde57dc14)) > [<c02120f8>] tcp_v4_do_rcv [kernel] 0x38 (0xde57dc34)) > [<c0201e5e>] cleanup_rbuf [kernel] 0xae (0xde57dc64)) > [<c0202a75>] tcp_recvmsg [kernel] 0x7e5 (0xde57dc7c)) > [<c021eac9>] inet_recvmsg [kernel] 0x39 (0xde57dce0)) > [<c01dbe91>] sock_recvmsg [kernel] 0x31 (0xde57dd0c)) > [<c021eb15>] inet_sendmsg [kernel] 0x35 (0xde57dd2c)) > [<c01dbe3c>] sock_sendmsg [kernel] 0x6c (0xde57dd40)) > [<c01dbf98>] sock_read [kernel] 0x88 (0xde57dd74)) > [<e09106d0>] vmadump_read_file [bproc] 0x0 (0xde57dda4)) > [<e09106b4>] vmadump_write_file [bproc] 0x0 (0xde57dda8)) > [<e0918ab8>] k_read_u_f [bproc] 0x34 (0xde57ddb8)) > [<c01261c3>] collect_signal [kernel] 0x93 (0xde57de0c)) > [<e0912236>] recv_process [bproc] 0x6a (0xde57de58)) > [<e0918be1>] k_close [bproc] 0xd (0xde57df38)) > [<e0917b70>] do_recv_proc_stub [bproc] 0x184 (0xde57df58)) > [<e090c319>] bproc_kernel_thread [bproc] 0x2d (0xde57dfb8)) > > bproc: connect: connect to 192.168.2.4:46219 failed; errno=111 |
From: Nicholas H. <he...@se...> - 2003-09-17 18:13:46
|
On Wed, 2003-09-17 at 12:16, Nicholas Henke wrote: > Hello~ > I am getting a repeatable oops with either 2.4.20 or 2.4.21 on bproc > 3.2.5 or 3.2.6. Attached is the oops trace. The oops is not fatal all of > the time, but left unchecked, it really makes a mess of things. Any > ideas? Just for fun -- where in bproc/vmadump would the kernel stack be used by bproc stuffs? Is there a way to reduce that ? Nic -- Nicholas Henke Penguin Herder & Linux Cluster System Programmer Liniac Project - Univ. of Pennsylvania |
From: Nicholas H. <he...@se...> - 2003-09-17 16:16:43
|
Hello~ I am getting a repeatable oops with either 2.4.20 or 2.4.21 on bproc 3.2.5 or 3.2.6. Attached is the oops trace. The oops is not fatal all of the time, but left unchecked, it really makes a mess of things. Any ideas? Nic -- Nicholas Henke Penguin Herder & Linux Cluster System Programmer Liniac Project - Univ. of Pennsylvania |
From: Jeremy A. <arc...@co...> - 2003-09-17 03:17:16
|
I understand about the advantages of kernel modules, but my question really is: "Do the network devices HAVE to be compiled as modules?" For right now I have two computers with identical configurations. One is running the kernel that I compiled and has bproc running. The second computer is giving problems with the IRQ trap. To decrease the size of the kernel, I removed sound, usb, power management, and some other devices. Since the kernel is running on the first computer, I assume (my last words) that it will run on the second computer. If anyone has successfully got bproc/beoboot working where the network devices are compiled into the kernel, I would be greatful if you just said so, so that I know that the problem lies elsewhere. None-the-less, tomorrow I will be recompiling the kernel (again) but this time with the devices loaded as modules. Thanks for all your help thus far. The eventual cluster will have 240 cpus in a single rack, and the single system image and netbooting will be very (understatement) handy. If the blades support PXE, I will try that too. Thanks again. -J p.s. If you work at LANL, send me an email, because I am sure I could use some pointers on this. On Tuesday, Sep 16, 2003, at 17:15 US/Pacific, Joshua Aune wrote: > On Wed, Sep 17, 2003 at 12:35:53AM +0100, jeremy archuleta wrote: >> The question: Do network devices need to be compiled as modules? > > For phase1 it is typically a good idea for phase1. This allows the > modules to be unloaded and some hardware shutdown to happen before the > new kernel takes over. > > For phase2 it helps because it keeps the kernel size small. In the > past > I have run into a bug if the phase2 kernel image is too big. > >> almost there... > > It's well worth the pain after get there. I have found that a bproc > cluster is soo much easier to build and maintain than a traditional > cluster. > > Josh > > -- > Joshua Aune > http://www.linuxnetworx.com > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > BProc-users mailing list > BPr...@li... > https://lists.sourceforge.net/lists/listinfo/bproc-users > |
From: Joshua A. <lu...@li...> - 2003-09-17 00:15:17
|
On Wed, Sep 17, 2003 at 12:35:53AM +0100, jeremy archuleta wrote: > The question: Do network devices need to be compiled as modules? For phase1 it is typically a good idea for phase1. This allows the modules to be unloaded and some hardware shutdown to happen before the new kernel takes over. For phase2 it helps because it keeps the kernel size small. In the past I have run into a bug if the phase2 kernel image is too big. > almost there... It's well worth the pain after get there. I have found that a bproc cluster is soo much easier to build and maintain than a traditional cluster. Josh -- Joshua Aune http://www.linuxnetworx.com |
From: jeremy a. <arc...@co...> - 2003-09-16 23:35:56
|
The question: Do network devices need to be compiled as modules? I have them compiled into the kernel, hence almost using nothing from /etc/beowulf/config.boot except for the long pci list. Does anyone else have their network devices compiled into the kernel rather than as modules? The reason I ask is because I have now gotten to here: rebooting in 2 secs but then.... Unexpected IRQ trap at vector 11 Unexpected IRQ trap at vector 11 Unexpected IRQ trap at vector 11 Unexpected IRQ trap at vector 11 ... lspci shows: 00:11.0 Ethernet controller: 3Com Corporation 3c905 almost there... -j |
From: Joshua A. <lu...@ln...> - 2003-09-16 17:54:20
|
On Tue, Sep 16, 2003 at 06:38:50PM +0100, jeremy archuleta wrote: > > Great. Thanks for the help. > > The bootfile line in /etc/beowulf/config was the trick. > > > > Now another error ; ) > > > > finished retransmitting > > monte: kernel setup : 2650 bytes at 0x90000) > > monte: kernel code : 1006138 bytes at 0x100000 > > couldn't find symbol real_mode_conf > > > > > > any thoughts? bad kernel compile? > > will try again from scratch.... Like you discovered, probably missing the 2k monte stuff. > > > I found the following in the release notes: > > and I have found a patch for kernel 2.4.17 > monte/linux-2.4.17-save_real_mode_conf.patch > > does this patch apply to 2.4.17 and newer, and thus be used for 2.4.21 (my > kernel)? If so, I assume > "patch -p0 < beoboot-cm.1.5/monte/linux-2.4.17-save_real_mode_conf.patch" will work. > I have one from 2.4.20 that I am currently using with 2.4.21 and higher kernels. Attached. Also remember to select the save real mode option in your phase1 kernel .config file after patching. Josh |
From: jeremy a. <arc...@co...> - 2003-09-16 17:38:55
|
Quoting jeremy archuleta <arc...@co...>: > > try adding the line > > > > bootfile /var/beowulf/boot.img > > > > to /etc/beowulf/config and restart beoserv. > > > > - Erik > > > Great. Thanks for the help. > The bootfile line in /etc/beowulf/config was the trick. > > Now another error ; ) > > finished retransmitting > monte: kernel setup : 2650 bytes at 0x90000) > monte: kernel code : 1006138 bytes at 0x100000 > couldn't find symbol real_mode_conf > > > any thoughts? bad kernel compile? > will try again from scratch.... > > -J I found the following in the release notes: beoboot-lanl 1.2 ----------------------------------------------------- This version should be used with BProc version 3.1.6+ There are some monte MONTE_PROTECTED related cleanups which require that you patch the phase 1 beoboot kernel. This is necessary because the kernel normally throws away the information from the real mode code after reading it. It used to be possible to just find it at 90000h but boot loaders have begun putting that information at other addresses. and I have found a patch for kernel 2.4.17 monte/linux-2.4.17-save_real_mode_conf.patch does this patch apply to 2.4.17 and newer, and thus be used for 2.4.21 (my kernel)? If so, I assume "patch -p0 < beoboot-cm.1.5/monte/linux-2.4.17-save_real_mode_conf.patch" will work. -J |
From: jeremy a. <arc...@co...> - 2003-09-16 17:06:01
|
> try adding the line > > bootfile /var/beowulf/boot.img > > to /etc/beowulf/config and restart beoserv. > > - Erik Great. Thanks for the help. The bootfile line in /etc/beowulf/config was the trick. Now another error ; ) finished retransmitting monte: kernel setup : 2650 bytes at 0x90000) monte: kernel code : 1006138 bytes at 0x100000 couldn't find symbol real_mode_conf any thoughts? bad kernel compile? will try again from scratch.... -J |
From: <er...@he...> - 2003-09-16 16:19:25
|
On Mon, Sep 15, 2003 at 07:35:37PM -0700, Jeremy Archuleta wrote: > > > On Monday, Sep 15, 2003, at 17:34 US/Pacific, Joshua Aune wrote: > > > On Tue, Sep 16, 2003 at 12:42:34AM +0100, jeremy archuleta wrote: > >> > >> The error from the slave: > >> RARP: BPROC 2223; File 4711; file:/var/beowulf/boot.img > >> recv: <someline about backoff was here> > >> recv: resend listing on port 1024 > >> recv: requesting /var/beowulf/boot.img from 192.168.36.110:4711 > >> recv: response from server: /var/beowulf/boot.img : File unavailable > >> Boot image download failure > > > > The one that usually bites me when I get this error is either the file > > (in this case /var/beowulf/boot.img) doesn't exist (or can't be read). > > > >> Here are the commands to create the phase 1 and phase 2 images: > >> sudo beoboot -1 -f -k /boot/vmlinuz-2.4.21 -o /dev/fd0 > >> sudo beoboot -2 -n -k /boot/vmlinuz-2.4.21 -o /var/beowulf/boot.img > > > > Try an ls -l /var/beowulf/boot.img to make sure it is there and > > readable > > by the beoserv process. > > I know that exists. I am curious to know if I also need to set up a > tftp server. In case I do, I have set one up with boot.img in both > /tftpboot/ and /tftpboot/var/beowulf/ but this didn't help either. > Tomorrow I will try again... try adding the line bootfile /var/beowulf/boot.img to /etc/beowulf/config and restart beoserv. - Erik |
From: Miguel D. C. <mc...@fc...> - 2003-09-16 11:05:11
|
Hello all! This message was sent to the openmosix mailing list: Date: Sun, 14 Sep 2003 19:04:03 +0300 From: csaa <cs...@un...> To: ope...@li... Subject: [openMosix-general] checkpoint/restart utility =20 Hi all! There have been several questions about checkpoint/restart of processes under openMosix. We have written utility that can transparently=20 checkpoint/restart of processes. It works with openMosix kernel and can be found=20 at http://freshmeat.net/projects/chpox/ or at http://www.cluster.kiev.ua/tasks/chpx_eng.html. It is used on the cluster of Information&Computing Center of Kyiv Taras Shevchenko University=20 [http://www.cluster.kiev.ua] and probably will be useful for someone else. =20 Regards Olexandr Sudakov, Eugeniy Meshcheryakov =20 According to their website, this is based on VMADUMP. Would it work in a bproc cluster? Is anyone using other checkpointing tools in bproc clusters? Cheers, Miguel --=20 Miguel Dias Costa <mc...@fc...> Centro de F=EDsica do Porto |
From: Jeremy A. <arc...@co...> - 2003-09-16 01:37:33
|
On Monday, Sep 15, 2003, at 17:34 US/Pacific, Joshua Aune wrote: > On Tue, Sep 16, 2003 at 12:42:34AM +0100, jeremy archuleta wrote: >> >> The error from the slave: >> RARP: BPROC 2223; File 4711; file:/var/beowulf/boot.img >> recv: <someline about backoff was here> >> recv: resend listing on port 1024 >> recv: requesting /var/beowulf/boot.img from 192.168.36.110:4711 >> recv: response from server: /var/beowulf/boot.img : File unavailable >> Boot image download failure > > The one that usually bites me when I get this error is either the file > (in this case /var/beowulf/boot.img) doesn't exist (or can't be read). > >> Here are the commands to create the phase 1 and phase 2 images: >> sudo beoboot -1 -f -k /boot/vmlinuz-2.4.21 -o /dev/fd0 >> sudo beoboot -2 -n -k /boot/vmlinuz-2.4.21 -o /var/beowulf/boot.img > > Try an ls -l /var/beowulf/boot.img to make sure it is there and > readable > by the beoserv process. I know that exists. I am curious to know if I also need to set up a tftp server. In case I do, I have set one up with boot.img in both /tftpboot/ and /tftpboot/var/beowulf/ but this didn't help either. Tomorrow I will try again... Thanks. -J |
From: Joshua A. <lu...@ln...> - 2003-09-16 00:36:00
|
On Tue, Sep 16, 2003 at 12:42:34AM +0100, jeremy archuleta wrote: > > The error from the slave: > RARP: BPROC 2223; File 4711; file:/var/beowulf/boot.img > recv: <someline about backoff was here> > recv: resend listing on port 1024 > recv: requesting /var/beowulf/boot.img from 192.168.36.110:4711 > recv: response from server: /var/beowulf/boot.img : File unavailable > Boot image download failure The one that usually bites me when I get this error is either the file (in this case /var/beowulf/boot.img) doesn't exist (or can't be read). > Here are the commands to create the phase 1 and phase 2 images: > sudo beoboot -1 -f -k /boot/vmlinuz-2.4.21 -o /dev/fd0 > sudo beoboot -2 -n -k /boot/vmlinuz-2.4.21 -o /var/beowulf/boot.img Try an ls -l /var/beowulf/boot.img to make sure it is there and readable by the beoserv process. Also, just beoboot -2 -n -k /boot/vmlinuz-2.4.21 should suffice assuming that the default config files specify the bootfile. ie: bootfile /var/beowulf/phase2.img Josh |
From: jeremy a. <arc...@co...> - 2003-09-15 23:43:09
|
So I have been able to compile and install bproc-3.2.6, cmtools-1.2, and beoboot-cm.1.5 with kernel 2.4.21. But now I can't get beoboot to get the phase 2 image. The error from the slave: RARP: BPROC 2223; File 4711; file:/var/beowulf/boot.img recv: <someline about backoff was here> recv: resend listing on port 1024 recv: requesting /var/beowulf/boot.img from 192.168.36.110:4711 recv: response from server: /var/beowulf/boot.img : File unavailable Boot image download failure The error from the master: > sudo beoserv -v eth0: data socket 192.168.36.110:1847 beoserv: node_up listening at: /tmp/.node_up beoserv: RARP: 00:C0:4F:6B:8C:7E == 192.168.36.50 192.168.36.50:1024 request filename=/var/beowulf/boot.img depth=0 resend_port=1024 fail=0.0.0.0:0 192.168.36.50:1024 response: status=100 addr=0.0.0.0:0 depth=0 Here are the commands to create the phase 1 and phase 2 images: sudo beoboot -1 -f -k /boot/vmlinuz-2.4.21 -o /dev/fd0 sudo beoboot -2 -n -k /boot/vmlinuz-2.4.21 -o /var/beowulf/boot.img I have not modified any of the /etc/beowulf/ files, nor exporting any NFS directories from the master because I can't tell if I need to or not. Any help would be appreciated. -Jeremy |
From: jeremy a. <arc...@co...> - 2003-09-12 21:37:25
|
Can anyone send me the beoboot documentation in either DVI or PS format? I am getting errors on the texi2dvi translation (every underscore is an error and there are 4 "Overfull \hbox" errors) Thanks. -jeremy |
From: Wally E. <Wal...@ut...> - 2003-09-12 13:51:25
|
Joshua Aune wrote: >Check /proc/sys/net/core/[rmem*,wmem*] and friends on the nodes. If I >remember correctly the init script for nfs in rh8 fiddles with these for >the better. Also see http://tldp.org/HOWTO/NFS-HOWTO/performance.html >for some ideas and reccomended values for these settings. Thanks for the tip. I didn't know that those files were there. That wasn't the problem, however. Those files were the same on both systems. To fix my NFS slowness, I added bg and soft options to the clustermatic mounts in fstab: bg,soft,nolock,intr,rsize=8192,wsize=8192 0 0 That got my speed to about 8 seconds for a 250MB transfer, which is fine for now. I will be using the tips from you and Michael Madore to tweak it some more. Thanks for your help. At 12:52 PM 9/11/2003 -0600, Joshua Aune wrote: >Doing this on one system took me from 6MB/s to 70MB/s with a single >server and single client communicating over gigE :). This required tuning >on both the client and server settings. > >The system that hit 70MB/s on an iozone throughput test had the >following options: rw,rsize=8182,wsize=8192,noac > > > And a sample from /etc/fstab on a non-clustermatic node with the same > > hardware: > > nfsserver:/exports/home /home nfs bg,soft,intr,rsize=8192,wsize=8192 0 0 > >Hope this helps, >Josh |
From: Florent C. <Flo...@un...> - 2003-09-12 11:28:44
|
Florent Calvayrac wrote: > Nicholas Henke wrote: > >> On Fri, 2003-09-05 at 05:53, Florent Calvayrac wrote: >> >>> Hi >>> >>> I am in the process of joining an experimental cluster (8 nodes) >>> to a larger one (40+ nodes) using Clustermatic; everything is >>> working fine excepted for the last 4 nodes : I get an >>> >>> bpmaster: Connect from unrecognized node 192.168.32.55 >>> >>> and so on, although the nodes got an IP assigned with RARP and >>> reboot and the very end of the stage 2... >>> >>> loooking at the source code I wonder if this comes >>> from a socket not accepted in "master". How do I increase >>> the number of available sockets ? What configuration file >>> can I have bad written ? >> >> >> >> It sound like you need to add your new nodes in /etc/beowulf/config -- >> or the config file you use for bpmaster. >> >> Nic > > > > thanks again ; but yes, I had increased the number of nodes, > and I did a kill -HUP <beoserv> without it complaining. > I eventually found the problem : with a kill -HUP beoserv just rereads the list of MAC addresses and does not allocate more nodes than when launched the first time. I had to restart beoserv, thus killing the jobs being processed on the older nodes. This should maybe be fixed or included in the documentation ; or at least, maybe declaring at first more nodes than present in the cluster... -- Florent Calvayrac | Tel : 02 43 83 26 26 Laboratoire de Physique de l'Etat Condense | Fax : 02 43 83 35 18 UMR-CNRS 6087 | http://www.univ-lemans.fr/~fcalvay Universite du Maine-Faculte des Sciences | 72085 Le Mans Cedex 9 |
From: steven j. <py...@li...> - 2003-09-11 22:18:12
|
Greetings, The best bet will be to install the debian rpm package, and that to install the SRPM. From there, you can grab the tarball out of SOURCES and build and install it. It should work just fine then. G'day, sjames On Thu, 11 Sep 2003, jeremy archuleta wrote: > I am trying to install/compile beoboot-cm.1.5 for Debian using kernel 2.4.21. I > have successfully compiled the kernel and bproc-3.2.6 but now need the remote > booting portion. > > My problem seems to be that Debian doesn't have libmodutils, libmodutilobj, or > libmodutilutil and hence "make" fails at: > > ld -r -o init.o boot.o rarp.o recv.o monte/libmonte.a -Bstatic -L/usr/lib > -lmodutils -lmodutilobj -lmodutilutil -lz -lcmconf > ld: cannot find -lmodutils > > I have found that the modutils-devel RPM exists (and contains the libraries), > but I don't know if I can use this on Debian. Is there another place I can > get/create the libraries? or is there another remote boot option that Debian > users are using for bproc? > > Thanks. > -Jeremy > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > BProc-users mailing list > BPr...@li... > https://lists.sourceforge.net/lists/listinfo/bproc-users > -- -------------------------steven james, director of research, linux labs ... ........ ..... .... 230 peachtree st nw ste 2701 the original linux labs atlanta.ga.us 30303 -since 1995 http://www.linuxlabs.com office 404.577.7747 fax 404.577.7743 ----------------------------------------------------------------------- |
From: jeremy a. <arc...@co...> - 2003-09-11 21:33:20
|
I am trying to install/compile beoboot-cm.1.5 for Debian using kernel 2.4.21. I have successfully compiled the kernel and bproc-3.2.6 but now need the remote booting portion. My problem seems to be that Debian doesn't have libmodutils, libmodutilobj, or libmodutilutil and hence "make" fails at: ld -r -o init.o boot.o rarp.o recv.o monte/libmonte.a -Bstatic -L/usr/lib -lmodutils -lmodutilobj -lmodutilutil -lz -lcmconf ld: cannot find -lmodutils I have found that the modutils-devel RPM exists (and contains the libraries), but I don't know if I can use this on Debian. Is there another place I can get/create the libraries? or is there another remote boot option that Debian users are using for bproc? Thanks. -Jeremy |
From: Joshua A. <lu...@li...> - 2003-09-11 18:53:24
|
On Thu, Sep 11, 2003 at 11:03:04AM -0400, Wally Edmondson wrote: > Here is a sample from my /etc/beowulf/fstab file for the cluster: > 172.16.129.1:/exports/home /home nfs > rw,nolock,rsize=8182,wsize=8192,noac 0 0 > Check /proc/sys/net/core/[rmem*,wmem*] and friends on the nodes. If I remember correctly the init script for nfs in rh8 fiddles with these for the better. Also see http://tldp.org/HOWTO/NFS-HOWTO/performance.html for some ideas and reccomended values for these settings. If you haven't used Doing this on one system took me from 6MB/s to 70MB/s with a single server and single client communicating over gigE :). This required tuning on both the client and server settings. The system that hit 70MB/s on an iozone throughput test had the following options: rw,rsize=8182,wsize=8192,noac > And a sample from /etc/fstab on a non-clustermatic node with the same > hardware: > nfsserver:/exports/home /home nfs bg,soft,intr,rsize=8192,wsize=8192 0 0 Hope this helps, Josh |
From: Wally E. <Wal...@ut...> - 2003-09-11 15:03:19
|
I am experiencing bad NFS performance on my cluster nodes. Is this common? Is there a cure? I have some diskless clustermatic nodes and some diskless nodes just PXE booted with Red Hat 8. The hardware is exactly the same. My non-clustermatic nodes transfer a 350 MB file from an NFS server to its /tmp directory in 12.5 seconds. The clustermatic nodes take about 31 seconds to transfer the same file from the same server. I am guessing that it has something to do with the options in my fstab file, but I don't know much about the noack and nolock options. Anyone else having this problem? Here is a sample from my /etc/beowulf/fstab file for the cluster: 172.16.129.1:/exports/home /home nfs rw,nolock,rsize=8182,wsize=8192,noac 0 0 And a sample from /etc/fstab on a non-clustermatic node with the same hardware: nfsserver:/exports/home /home nfs bg,soft,intr,rsize=8192,wsize=8192 0 0 Thanks in advance for any help you can offer. Wally Wal...@ut... |
From: Michal J. <mi...@ha...> - 2003-09-07 05:33:26
|
On Sat, Sep 06, 2003 at 08:51:32PM -0600, Michal Jaegermann wrote: > > So how I am supposed to start that daemon on a node? Anybody with > ideas? I will allow myself to respond to my own posting. It looks like that I found a suitable hack. With /etc/beowulf/node_up as follows: #!/bin/sh /usr/lib/beoboot/bin/node_up $* || exit 1 echo "bpsh $1 /usr/sbin/gmond" | at now 2> /dev/null this does what I wanted. Clearly this may be used to run much more extensive sets of commands. The trick seems to be that 'at' executes everything in a different context than a direct startup for a node. Still if somebody have better ideas I am all ears. Interestingly enogh 'ps uwwaxf' has the following to show on a master node: root 3897 0.0 0.0 0 0 ? RWN 23:15 0:00 [gmond] root 3898 0.0 0.0 0 0 ? RWN 23:15 0:00 \_ [gmond] root 3899 0.0 0.0 0 0 ? RWN 23:15 0:00 \_ [gmond] root 3900 0.0 0.0 0 0 ? RWN 23:15 0:00 \_ [gmond] root 3901 0.0 0.0 0 0 ? RWN 23:15 0:00 \_ [gmond] root 3902 0.0 0.0 0 0 ? RWN 23:15 0:00 \_ [gmond] root 3903 0.0 0.0 0 0 ? SWN 23:15 0:00 \_ [gmond] root 3904 0.0 0.0 0 0 ? SWN 23:15 0:00 \_ [gmond] although 'bpsh 0 ps uwwaxf' comes only with this: root 3897 0.0 0.1 15816 1384 ? SN 23:15 0:00 /usr/sbin/gmond Michal |
From: Michal J. <mi...@ha...> - 2003-09-07 02:51:49
|
I have a test cluster (one master and one node) where both machines are running 2.4.19-lanl.22smp from "Clustermatic 3" CD while booting is done via beoboot-cm.1.5. I am trying to get ganglia running in this setup. I got all components compiled and installed and after fiddling with a configuration a bit they even work; but there is a catch. On nodes I need to run 'gmond' and a natural thing would be to start it in /etc/beowulf/node_up. By default (when /etc/gmond.conf is absent, for example) it runs as user "nobody". Attempts to do in any time end up with "user 'nobody' does not exist". One can put a suitable /etc/gmond.conf on nodes with a help of a line like plugin miscfiles /etc/beowulf/node/gmond.conf>/etc/gmond.conf There we have two options. One is 'setuid root'. This brings "user 'root' does not exist" and gmond does not start in 'node_up' although executing that later by typing commands is fine. Also the 'node_up' does not really return and a node status ends as "error". Another possibility is 'no_setuid on' in a node configuration file. This works to an extent. 'node_up' script which looks like that: /usr/lib/beoboot/bin/node_up $* || exit 1 echo "running bpsh $1 gmond" sleep 2 bpsh $1 /usr/sbin/gmond bpsh $1 ps uwwaxf sleep 2 bpsh $1 ps uwwaxf prints the following in a log file for a node: ..... nodeup : Node setup returned status 0 running bpsh 0 gmond USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND (null) 2014 0.0 0.0 14456 1004 ? S 19:59 0:00 /usr/sbin/gmond (null) 2016 0.0 0.1 14564 1116 ? S 19:59 0:00 /usr/sbin/gmond (null) 2019 0.0 0.1 14584 1136 ? S 19:59 0:00 \_ /usr/sbin/gmond (null) 2020 0.0 0.1 14600 1160 ? R 19:59 0:00 \_ /usr/sbin/gmond (null) 2021 0.0 0.1 14600 1180 ? S 19:59 0:00 \_ /usr/sbin/gmond (null) 2022 0.0 0.1 14604 1184 ? S 19:59 0:00 \_ /usr/sbin/gmond (null) 2023 0.0 0.1 14604 1188 ? S 19:59 0:00 \_ /usr/sbin/gmond (null) 2024 0.0 0.1 14604 1188 ? S 19:59 0:00 \_ /usr/sbin/gmond USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND (null) 2014 0.0 0.1 15820 1384 ? S 19:59 0:00 /usr/sbin/gmond but if you check right after 'node_up' returned then gmond is gone. So how I am supposed to start that daemon on a node? Anybody with ideas? Attempts through a subshell in a shell wrapper, on an off-chance that we are not detaching properly from a controlling terminal, fail with a bang. I tried that also with portmap. A difference is that ps prints "#1" instead of "(null)" in a USER field. Good enough to do some NFS mounts with locking from 'node_up' but later portmap is also gone. It appears that I am lucky that this one does not seem to care about under which user id it runs. In this particular case I guess that it is possible to work around the issue by having a cron job which repeatedly starts gmond for every node which is 'up' on 'bpstat' list. This is not particularly nice, I am afraid. BTW - with beoboot-cm.1.4 I had big troubles to run from 'node_up' mostly anything at all. 1.5 is a progress in that respect. Michal |
From: J.A. M. <jam...@ab...> - 2003-09-05 22:07:06
|
On 09.05, Nicholas Henke wrote: > > J.A. -- I have seen your patches in googles here and there, but cannot > seem to find a valid page. Do you have a 'home page' for your patches ? > I would love to play with them a bit. > The eee box borked on a power outage. It's up again. http://giga.cps.unizar.es/~magallon/linux/kernel/2.4.22-jam1m.tar.gz http://giga.cps.unizar.es/~magallon/linux/kernel/2.4.23-pre2-jam1m.tar.gz (them m in -jamXm is for mainline. If I get bproc working with -aa again, I will release -jamX again...) Probably I will let pre3 skip, because I think -pre4 will include the -aa VM, and I have some more fixes on the queue. > > > In short, a RH kernel has nothing to do with a standard kernel, so > > expect a ton of trouble. > > With RH or RH kernels ? :) > With RH kernels trying to apply on top of them any patch designed for a standard kernel ;) -- J.A. Magallon <jam...@ab...> \ Software is like sex: werewolf.able.es \ It's better when it's free Mandrake Linux release 9.2 (Cooker) for i586 Linux 2.4.23-pre2-jam1m (gcc 3.3.1 (Mandrake Linux 9.2 3.3.1-1mdk)) |
From: Florent C. <Flo...@un...> - 2003-09-05 16:23:52
|
Nicholas Henke wrote: > On Fri, 2003-09-05 at 05:53, Florent Calvayrac wrote: > >>Hi >> >>I am in the process of joining an experimental cluster (8 nodes) >>to a larger one (40+ nodes) using Clustermatic; everything is working >>fine excepted for the last 4 nodes : I get an >> >> bpmaster: Connect from unrecognized node 192.168.32.55 >> >>and so on, although the nodes got an IP assigned with RARP and >>reboot and the very end of the stage 2... >> >>loooking at the source code I wonder if this comes >>from a socket not accepted in "master". How do I increase >>the number of available sockets ? What configuration file >>can I have bad written ? > > > It sound like you need to add your new nodes in /etc/beowulf/config -- > or the config file you use for bpmaster. > > Nic thanks again ; but yes, I had increased the number of nodes, and I did a kill -HUP <beoserv> without it complaining. -- Florent Calvayrac | Tel : 02 43 83 26 26 Laboratoire de Physique de l'Etat Condense | Fax : 02 43 83 35 18 UMR-CNRS 6087 | http://www.univ-lemans.fr/~fcalvay Universite du Maine-Faculte des Sciences | 72085 Le Mans Cedex 9 |