You can subscribe to this list here.
| 2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(25) |
Nov
|
Dec
(22) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2002 |
Jan
(13) |
Feb
(22) |
Mar
(39) |
Apr
(10) |
May
(26) |
Jun
(23) |
Jul
(38) |
Aug
(20) |
Sep
(27) |
Oct
(76) |
Nov
(32) |
Dec
(11) |
| 2003 |
Jan
(8) |
Feb
(23) |
Mar
(12) |
Apr
(39) |
May
(1) |
Jun
(48) |
Jul
(35) |
Aug
(15) |
Sep
(60) |
Oct
(27) |
Nov
(9) |
Dec
(32) |
| 2004 |
Jan
(8) |
Feb
(16) |
Mar
(40) |
Apr
(25) |
May
(12) |
Jun
(33) |
Jul
(49) |
Aug
(39) |
Sep
(26) |
Oct
(47) |
Nov
(26) |
Dec
(36) |
| 2005 |
Jan
(29) |
Feb
(15) |
Mar
(22) |
Apr
(1) |
May
(8) |
Jun
(32) |
Jul
(11) |
Aug
(17) |
Sep
(9) |
Oct
(7) |
Nov
(15) |
Dec
|
|
From: <er...@he...> - 2002-10-10 19:50:05
|
On Wed, Oct 09, 2002 at 07:38:10PM -0600, Wilton Wong wrote: > Well after working a bit more it appears I may have a genuine bug in > _bproc_vrfork_io(), I can seem to only fork off one process at a time, see the > attached patch to beoboot../node_up/node_up.c and see what I mean. > > Anyways we have no ideas here.. and really have no clue as to what is supposed > to happen or what is not happening ;) this patch was sort of a "we know what > works" now let's just do that type patch.. not for use in real production > environments. All I know is that with this patch I am able to boot more than 1 > node at the same time. It works fine for me. I've seen node_up do 50+ nodes at a time on our cluster here. The only hidden gotcha that I can think of with vrfork is that the nodes need to be able to reach one another w/ IP. So, if they're on different subnets, etc. you're going to have some trouble. The default route that the boot code puts in is bogus. If it's working for only one process at once, you *should* be able to force it to always do that w/o hacking anything by sticking "nodeupmaxclients 1" in /etc/beowulf/config. - Erik |
|
From: <er...@he...> - 2002-10-10 17:22:40
|
On Wed, Oct 09, 2002 at 07:47:30PM -0600, Wilton Wong wrote:
> We have been trying to integrate this for sometime without much success, there
> seems to be a deadlock in the kernel, somewhere someone is locking using the
> wrong lock or in the wrong context or something.. when we run more than one
> process per node, in this case "bpsh <node> yes".. eventually (within a matter
> of seconds the kernel is too busy to handle and requests such as responding to
> the bproc heartbeat)
>
> A forced kernel stack dump using lcrash reveals:
>
> ....
> dc61c000 0 1462 1418 0x01 0x00000000 0:0 yes
> dc64c000 0 1463 1415 0x00 0x00000040 402:127 yes
> dc5e6000 0 1464 1418 0x02 0x00000000 25:6 yes
> dc5c2000 0 1465 1415 0x00 0x00000040 402:127 yes
> >> trace dc5e6000
> ================================================================
> STACK TRACE FOR TASK: 0xdc5e6000(yes)
>
> 0 schedule+901 [0xc01197d5]
> 1 schedule_timeout+18 [0xc0126582]
> 2 [bproc]bproc_response_wait+115 [0xe08c3697]
> 3 [bproc]send_process+163 [0xe08c20e3]
> 4 [bproc]do_execmove+126 [0xe08c6eee]
> 5 [bproc]do_bproc+980 [0xe08c7744]
> 6 system_call+44 [0xc0108f94]
> ebx: 00000000 ecx: 00000000 edx: 00000000 esi: 00000000
> edi: 00000000 ebp: 00000000 eax: 00000000 ds: 002b
> es: 002b eip: 40000b50 cs: 0023 eflags: 00000216
> esp: bffffb50 ss: 002b
> ================================================================
>
> And of course if we remove the O(1) scheduler everything works fine.. any help
> in where to look for this problem would be appreciated.
If you're eventually falling down on a ping timeout, it sounds like
something is making a bad scheduling decision. Try commenting out
this snippet from the slave daemon and see if things get better.
p.sched_priority = 1;
if (sched_setscheduler(0, SCHED_FIFO, &p))
syslog(LOG_NOTICE, "Failed to set real-time scheduling for"
" slave daemon.\n");
That's the only even vaguely odd scheduling thing BProc does. For the
rest it's just very uninteresting wait queue and task status (running,
interruptible, etc.) stuff.
- Erik
|
|
From: Wilton W. <ww...@ha...> - 2002-10-10 01:47:45
|
We have been trying to integrate this for sometime without much success, there seems to be a deadlock in the kernel, somewhere someone is locking using the wrong lock or in the wrong context or something.. when we run more than one process per node, in this case "bpsh <node> yes".. eventually (within a matter of seconds the kernel is too busy to handle and requests such as responding to the bproc heartbeat) A forced kernel stack dump using lcrash reveals: .... dc61c000 0 1462 1418 0x01 0x00000000 0:0 yes dc64c000 0 1463 1415 0x00 0x00000040 402:127 yes dc5e6000 0 1464 1418 0x02 0x00000000 25:6 yes dc5c2000 0 1465 1415 0x00 0x00000040 402:127 yes >> trace dc5e6000 ================================================================ STACK TRACE FOR TASK: 0xdc5e6000(yes) 0 schedule+901 [0xc01197d5] 1 schedule_timeout+18 [0xc0126582] 2 [bproc]bproc_response_wait+115 [0xe08c3697] 3 [bproc]send_process+163 [0xe08c20e3] 4 [bproc]do_execmove+126 [0xe08c6eee] 5 [bproc]do_bproc+980 [0xe08c7744] 6 system_call+44 [0xc0108f94] ebx: 00000000 ecx: 00000000 edx: 00000000 esi: 00000000 edi: 00000000 ebp: 00000000 eax: 00000000 ds: 002b es: 002b eip: 40000b50 cs: 0023 eflags: 00000216 esp: bffffb50 ss: 002b ================================================================ And of course if we remove the O(1) scheduler everything works fine.. any help in where to look for this problem would be appreciated. Thanks - Wilton ----[ Wilton William Wong ]--------------------------------------------- 11060-166 Avenue Ph : 01-780-456-9771 High Performance UNIX Edmonton, Alberta FAX: 01-780-456-9772 and Linux Solutions T5X 1Y3, Canada URL: http://www.harddata.com -------------------------------------------------------[ Hard Data Ltd. ]---- |
|
From: Wilton W. <ww...@ha...> - 2002-10-10 01:38:26
|
Well after working a bit more it appears I may have a genuine bug in _bproc_vrfork_io(), I can seem to only fork off one process at a time, see the attached patch to beoboot../node_up/node_up.c and see what I mean. Anyways we have no ideas here.. and really have no clue as to what is supposed to happen or what is not happening ;) this patch was sort of a "we know what works" now let's just do that type patch.. not for use in real production environments. All I know is that with this patch I am able to boot more than 1 node at the same time. Our local linux hacker had this to say "I don't know what is going on.. this patch is one big hack" when he "fixed" this. ;) I am currently running: beoboot-lanl.1.3 bproc-3.2.1 linux-2.4.19+bproc patches - Wilton Original Message Follows: > I am having a bit of difficulty booting more than one node at the same time, > (booting works if I stagger the booting) something seems to hang up when I > reach the point in boeoboot where it starts the node_up worker processes for > more than 1 node.. > > In /var/beowulf/node.2 .. node.3 .. etc... I see it hangs here: > <SNIP> > ... > nodeup : Plugin vmadlib returned status 0 (ok) > nodeup : No premove function for nodeinfo > nodeup : Starting 2 child processes. > </SNIP> > > In /var/log/messages > <SNIP> > Oct 8 17:43:52 srv001 beoserv: Starting node_up worker for 2 clients. > </SNIP> > > The cluster boots fine if nodeup only starts 1 child process at a time. > > <SNIP> > ... > nodeup : Plugin vmadlib returned status 0 (ok) > nodeup : No premove function for nodeinfo > nodeup : Starting 1 child processes. > nodeup : Running postmove functions > nodeup : Calling postmove for kmod > nodeup : Plugin kmod returned status 0 (ok) > ... > </SNIP> > > <SNIP> > Oct 8 17:50:49 srv001 beoserv: Starting node_up worker for 1 clients. > </SNIP> ----[ Wilton William Wong ]--------------------------------------------- 11060-166 Avenue Ph : 01-780-456-9771 High Performance UNIX Edmonton, Alberta FAX: 01-780-456-9772 and Linux Solutions T5X 1Y3, Canada URL: http://www.harddata.com -------------------------------------------------------[ Hard Data Ltd. ]---- |
|
From: Wilton W. <ww...@ha...> - 2002-10-09 02:40:38
|
Silly me.. error on my part.. disregard my sillyness.. - Wilton On Tue, 08 Oct 2002, Wilton Wong wrote: > I am having a bit of difficulty booting more than one node at the same time, > (booting works if I stagger the booting) something seems to hang up when I > reach the point in boeoboot where it starts the node_up worker processes for > more than 1 node.. > > In /var/beowulf/node.2 .. node.3 .. etc... I see it hangs here: > <SNIP> > ... > nodeup : Plugin vmadlib returned status 0 (ok) > nodeup : No premove function for nodeinfo > nodeup : Starting 2 child processes. > </SNIP> > > In /var/log/messages > <SNIP> > Oct 8 17:43:52 srv001 beoserv: Starting node_up worker for 2 clients. > </SNIP> > > The cluster boots fine if nodeup only starts 1 child process at a time. > > <SNIP> > ... > nodeup : Plugin vmadlib returned status 0 (ok) > nodeup : No premove function for nodeinfo > nodeup : Starting 1 child processes. > nodeup : Running postmove functions > nodeup : Calling postmove for kmod > nodeup : Plugin kmod returned status 0 (ok) > ... > </SNIP> > > <SNIP> > Oct 8 17:50:49 srv001 beoserv: Starting node_up worker for 1 clients. > </SNIP> > > Any clues on where to look for the hang up ? > > - Wilton > > ----[ Wilton William Wong ]--------------------------------------------- > 11060-166 Avenue Ph : 01-780-456-9771 High Performance UNIX > Edmonton, Alberta FAX: 01-780-456-9772 and Linux Solutions > T5X 1Y3, Canada URL: http://www.harddata.com > -------------------------------------------------------[ Hard Data Ltd. ]---- > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > BProc-users mailing list > BPr...@li... > https://lists.sourceforge.net/lists/listinfo/bproc-users ----[ Wilton William Wong ]--------------------------------------------- 11060-166 Avenue Ph : 01-780-456-9771 High Performance UNIX Edmonton, Alberta FAX: 01-780-456-9772 and Linux Solutions T5X 1Y3, Canada URL: http://www.harddata.com -------------------------------------------------------[ Hard Data Ltd. ]---- |
|
From: Wilton W. <ww...@ha...> - 2002-10-08 23:51:30
|
I am having a bit of difficulty booting more than one node at the same time, (booting works if I stagger the booting) something seems to hang up when I reach the point in boeoboot where it starts the node_up worker processes for more than 1 node.. In /var/beowulf/node.2 .. node.3 .. etc... I see it hangs here: <SNIP> ... nodeup : Plugin vmadlib returned status 0 (ok) nodeup : No premove function for nodeinfo nodeup : Starting 2 child processes. </SNIP> In /var/log/messages <SNIP> Oct 8 17:43:52 srv001 beoserv: Starting node_up worker for 2 clients. </SNIP> The cluster boots fine if nodeup only starts 1 child process at a time. <SNIP> ... nodeup : Plugin vmadlib returned status 0 (ok) nodeup : No premove function for nodeinfo nodeup : Starting 1 child processes. nodeup : Running postmove functions nodeup : Calling postmove for kmod nodeup : Plugin kmod returned status 0 (ok) ... </SNIP> <SNIP> Oct 8 17:50:49 srv001 beoserv: Starting node_up worker for 1 clients. </SNIP> Any clues on where to look for the hang up ? - Wilton ----[ Wilton William Wong ]--------------------------------------------- 11060-166 Avenue Ph : 01-780-456-9771 High Performance UNIX Edmonton, Alberta FAX: 01-780-456-9772 and Linux Solutions T5X 1Y3, Canada URL: http://www.harddata.com -------------------------------------------------------[ Hard Data Ltd. ]---- |
|
From: steven j. <py...@li...> - 2002-10-08 11:25:23
|
Greetings, There are a good many sensors out there. I focus on the ones used in motherboards I deal with (thus, the three that I have). There is a natural performance advantage to reading one file over several for sensor data. It might be best to use the scan option as a compatability mode where a specialized driver is not available. The Tyan does take a while. That's a function of the way the chip itself samples data. I avoid the hit by having the driver understand that the granularity of sensor data is 5 seconds (actually it's more like 1, but 5 is 'good enough' and improves performance). I place that in the driver since it's specific to the chip. The SiS950 for example takes negligable time to read. In cases where the sensors are available on the LPC as well as i2c, I prefer a standalone driver so that it won't need to track lm_sensors at all. Since sensor information is fairly simple, a standalone driver matures quickly and need not be touched again until the next kernel branch comes out. G'day, sjames On Mon, 7 Oct 2002, Wilton Wong wrote: > It seems like alot of source to maintain especially since new chips come out > everyday and lm_sensors is always under active development.. wouldn't it be > better to have "mon" scan the /proc/sys/dev/sensors/chips file and readout the > information in each chip ? or have a seperate daemon do this and plug the info > into "mon" using monhole ? > > Also on the Tyan-2466 boards it takes a LONG time to read the sensor data (up > to a second, maybe even 2 to read all of the sensors) I think that period will > really adversly affect supermon performance, would "mon" block while sampling the > temperature data ? > > - Wilton > > On Mon, 07 Oct 2002, steven james wrote: > > > I have drivers for sis950, Winbond w83781d, and adm1021 if they would > > help. > > ----[ Wilton William Wong ]--------------------------------------------- > 11060-166 Avenue Ph : 01-780-456-9771 High Performance UNIX > Edmonton, Alberta FAX: 01-780-456-9772 and Linux Solutions > T5X 1Y3, Canada URL: http://www.harddata.com > -------------------------------------------------------[ Hard Data Ltd. ]---- > -- -------------------------steven james, director of research, linux labs ... ........ ..... .... 230 peachtree st nw ste 701 the original linux labs atlanta.ga.us 30303 -since 1995 http://www.linuxlabs.com office 404.577.7747 fax 404.577.7743 ----------------------------------------------------------------------- |
|
From: Wilton W. <ww...@ha...> - 2002-10-08 03:55:13
|
It seems like alot of source to maintain especially since new chips come out everyday and lm_sensors is always under active development.. wouldn't it be better to have "mon" scan the /proc/sys/dev/sensors/chips file and readout the information in each chip ? or have a seperate daemon do this and plug the info into "mon" using monhole ? Also on the Tyan-2466 boards it takes a LONG time to read the sensor data (up to a second, maybe even 2 to read all of the sensors) I think that period will really adversly affect supermon performance, would "mon" block while sampling the temperature data ? - Wilton On Mon, 07 Oct 2002, steven james wrote: > I have drivers for sis950, Winbond w83781d, and adm1021 if they would > help. ----[ Wilton William Wong ]--------------------------------------------- 11060-166 Avenue Ph : 01-780-456-9771 High Performance UNIX Edmonton, Alberta FAX: 01-780-456-9772 and Linux Solutions T5X 1Y3, Canada URL: http://www.harddata.com -------------------------------------------------------[ Hard Data Ltd. ]---- |
|
From: steven j. <py...@li...> - 2002-10-07 21:18:28
|
Greetings, It's not really very hard. Most of the sensor dirvers will keep an array of client structures and have an update function that expects a pointer to that struct. The struct itself has a pointer to a data struct with the actual sensor data. I generally just add a line in the probe function to store the client structs in a different array just for supermon. The convert_value function calls the update function, then just formats the data for proc. I have drivers for sis950, Winbond w83781d, and adm1021 if they would help. G'day, sjames On Mon, 7 Oct 2002, Wilton Wong wrote: > > On Mon, 07 Oct 2002, steven james wrote: > > > Currently, I'm using method number 3. But I need to update to the latest > > release and evaluate if that is still how I want to go. > > I see some of your code inside the newest supermon packages.. how hard is it to > add new sensor chips ? > > - Wilton > > ----[ Wilton William Wong ]--------------------------------------------- > 11060-166 Avenue Ph : 01-780-456-9771 High Performance UNIX > Edmonton, Alberta FAX: 01-780-456-9772 and Linux Solutions > T5X 1Y3, Canada URL: http://www.harddata.com > -------------------------------------------------------[ Hard Data Ltd. ]---- > -- -------------------------steven james, director of research, linux labs ... ........ ..... .... 230 peachtree st nw ste 701 the original linux labs atlanta.ga.us 30303 -since 1995 http://www.linuxlabs.com office 404.577.7747 fax 404.577.7743 ----------------------------------------------------------------------- |
|
From: Wilton W. <ww...@ha...> - 2002-10-07 20:01:01
|
On Mon, 07 Oct 2002, steven james wrote: > Currently, I'm using method number 3. But I need to update to the latest > release and evaluate if that is still how I want to go. I see some of your code inside the newest supermon packages.. how hard is it to add new sensor chips ? - Wilton ----[ Wilton William Wong ]--------------------------------------------- 11060-166 Avenue Ph : 01-780-456-9771 High Performance UNIX Edmonton, Alberta FAX: 01-780-456-9772 and Linux Solutions T5X 1Y3, Canada URL: http://www.harddata.com -------------------------------------------------------[ Hard Data Ltd. ]---- |
|
From: steven j. <py...@li...> - 2002-10-07 19:31:46
|
Greetings, Currently, I'm using method number 3. But I need to update to the latest release and evaluate if that is still how I want to go. G'day, sjames On Mon, 7 Oct 2002, Wilton Wong wrote: > We are just experimenting with supermon-1.3 and I was wondering what is the > best way to add lm_sensors data into supermon, I figure there is 3 feasable > ways: > > 1. Hack the lm_sensors modules to add data to /proc/sys/supermon (eww too much > work) > 2. Hack the lm_sensors sensord to work with mon using the monhole > 3. Hack mon to add lm_sensors data read from /proc/dev/... > > Comments ? > > - Wilton > > ----[ Wilton William Wong ]--------------------------------------------- > 11060-166 Avenue Ph : 01-780-456-9771 High Performance UNIX > Edmonton, Alberta FAX: 01-780-456-9772 and Linux Solutions > T5X 1Y3, Canada URL: http://www.harddata.com > -------------------------------------------------------[ Hard Data Ltd. ]---- > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > BProc-users mailing list > BPr...@li... > https://lists.sourceforge.net/lists/listinfo/bproc-users > -- -------------------------steven james, director of research, linux labs ... ........ ..... .... 230 peachtree st nw ste 701 the original linux labs atlanta.ga.us 30303 -since 1995 http://www.linuxlabs.com office 404.577.7747 fax 404.577.7743 ----------------------------------------------------------------------- |
|
From: Wilton W. <ww...@ha...> - 2002-10-07 19:17:00
|
We are just experimenting with supermon-1.3 and I was wondering what is the best way to add lm_sensors data into supermon, I figure there is 3 feasable ways: 1. Hack the lm_sensors modules to add data to /proc/sys/supermon (eww too much work) 2. Hack the lm_sensors sensord to work with mon using the monhole 3. Hack mon to add lm_sensors data read from /proc/dev/... Comments ? - Wilton ----[ Wilton William Wong ]--------------------------------------------- 11060-166 Avenue Ph : 01-780-456-9771 High Performance UNIX Edmonton, Alberta FAX: 01-780-456-9772 and Linux Solutions T5X 1Y3, Canada URL: http://www.harddata.com -------------------------------------------------------[ Hard Data Ltd. ]---- |
|
From: Nicholas H. <he...@se...> - 2002-09-30 20:26:53
|
Thanks for the patch -- we will test that in the next few days as we get a chance to get it onto machines. Thanks! Nic -- Nicholas Henke Linux cluster system programmer University of Pennsylvania he...@se... - 215.573.8149 |
|
From: Erik A. H. <hen...@la...> - 2002-09-30 19:58:31
|
On Mon, Sep 30, 2002 at 11:32:11AM -0400, Thomas Clausen wrote: > Dear Erik, > > I recently installed a new cluster with the newest cm. We need a scheduler > system and I have been looking at bjs. I would like to work on it. Is there a > newer version I should be working on? Yup. I've rewritten most of it. It should be a lot cleaner and safer now - I hope :). I just put another tarball on sourceforge. You will also need "cmtools" a library of random utility stuff that it uses. - Erik |
|
From: <er...@he...> - 2002-09-27 17:42:08
|
On Fri, Sep 27, 2002 at 10:55:15AM -0400, Nicholas Henke wrote: > After upgrading to 2.4.18, the machines are staying alive after the oops, so I > have gotten a decent ksymoops of the problem. It looks to be a definate bproc > issue --a bproc_hook_proc_ppid is where it traced the error to. If there is > anything I can do to get more info on this oops, the machine is still up and > working. ... > >>EIP; f898cf27 <[bproc]bproc_hook_proc_ppid+57/68> <===== > Trace; c015c965 <proc_pid_stat+275/2f0> > Trace; c011d054 <do_exit+254/270> > Trace; c015a593 <proc_info_read+63/120> > Trace; c013b1f6 <sys_read+96/120> > Trace; c010734b <system_call+33/38> I think this back trace is at least partly bogus. do_exit doesn't call proc_pid_stat. In any case there does appear to be a locking goof in proc_pid_stat w/ the proc_ppid hook. Unfortunately, the quickest way to fix this seems to be by modifying procfs just a bit. Try this patch for fs/proc/array.c in the kernel and let me know if it fixes your problem: --- linux-2.4.18/fs/proc/array.c.orig Fri Sep 27 11:08:37 2002 +++ linux-2.4.18/fs/proc/array.c Fri Sep 27 11:12:01 2002 @@ -305,7 +305,7 @@ sigset_t sigign, sigcatch; char state; int res; - pid_t ppid; + pid_t pid, ppid; struct mm_struct *mm; state = *get_task_state(task); @@ -343,15 +343,17 @@ nice = task->nice; read_lock(&tasklist_lock); - ppid = task->pid ? task->p_opptr->pid : 0; + pid = bproc_hook_imv(task->pid, proc_pid, (task)); + ppid = task->pid ? + bproc_hook_imv(task->p_opptr->pid, proc_ppid, (task)) : 0; read_unlock(&tasklist_lock); res = sprintf(buffer,"%d (%s) %c %d %d %d %d %d %lu %lu \ %lu %lu %lu %lu %lu %ld %ld %ld %ld %ld %ld %lu %lu %ld %lu %lu %lu %lu %lu \ %lu %lu %lu %lu %lu %lu %lu %lu %d %d\n", - bproc_hook_imv(task->pid, proc_pid, (task)), + pid, task->comm, state, - bproc_hook_imv(ppid, proc_ppid, (task)), + ppid, task->pgrp, task->session, tty_nr, |
|
From: Nicholas H. <he...@se...> - 2002-09-27 14:55:25
|
After upgrading to 2.4.18, the machines are staying alive after the oops, so I have gotten a decent ksymoops of the problem. It looks to be a definate bproc issue --a bproc_hook_proc_ppid is where it traced the error to. If there is anything I can do to get more info on this oops, the machine is still up and working. Nic Sep 27 10:38:59 node40.io.liniac.upenn.edu kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000010 Sep 27 10:38:59 node40.io.liniac.upenn.edu kernel: f898cf27 Sep 27 10:38:59 node40.io.liniac.upenn.edu kernel: *pde = 00000000 Sep 27 10:38:59 node40.io.liniac.upenn.edu kernel: Oops: 0000 Sep 27 10:38:59 node40.io.liniac.upenn.edu kernel: CPU: 1 Sep 27 10:38:59 node40.io.liniac.upenn.edu kernel: EIP: 0010:[<f898cf27>] Not tainted Using defaults from ksymoops -t elf32-i386 Sep 27 10:38:59 node40.io.liniac.upenn.edu kernel: EFLAGS: 00010202 Sep 27 10:38:59 node40.io.liniac.upenn.edu kernel: eax: 00000000 ebx: 00000000 ecx: 00000002 edx: f6eea000 Sep 27 10:38:59 node40.io.liniac.upenn.edu kernel: esi: f6eea000 edi: 00000000 ebp: ffffffff esp: f6ebfe8c Sep 27 10:38:59 node40.io.liniac.upenn.edu kernel: ds: 0018 es: 0018 ss: 0018 Sep 27 10:38:59 node40.io.liniac.upenn.edu kernel: Process ps (pid: 15061, stackpage=f6ebf000) Sep 27 10:38:59 node40.io.liniac.upenn.edu kernel: Stack: c015c965 f6eea000 000002ca 000002ca 00000000 ffffffff 00000004 0000003e Sep 27 10:38:59 node40.io.liniac.upenn.edu kernel: 0000027a 000000d7 00000217 00000000 00000000 00000005 00000001 00000000 Sep 27 10:38:59 node40.io.liniac.upenn.edu kernel: 00000000 00000000 00000000 0003eabf 00000000 00000000 ffffffff 00000000 Sep 27 10:38:59 node40.io.liniac.upenn.edu kernel: Call Trace: [proc_pid_stat+629/752] [do_exit+596/624] [proc_info_read+99/288] [sys_read+150/288] [system_call+51/56] Sep 27 10:38:59 node40.io.liniac.upenn.edu kernel: Call Trace: [<c015c965>] [<c011d054>] [<c015a593>] [<c013b1f6>] [<c010734b>] Sep 27 10:38:59 node40.io.liniac.upenn.edu kernel: Code: 8b 40 10 c3 90 8b 82 94 00 00 00 8b 40 7c c3 89 f6 f6 05 00 Error (Oops_bfd_perror): scan_arch for specified architecture Success >>EIP; f898cf27 <[bproc]bproc_hook_proc_ppid+57/68> <===== Trace; c015c965 <proc_pid_stat+275/2f0> Trace; c011d054 <do_exit+254/270> Trace; c015a593 <proc_info_read+63/120> Trace; c013b1f6 <sys_read+96/120> Trace; c010734b <system_call+33/38> Code; f898cf27 <[bproc]bproc_hook_proc_ppid+57/68> 00000000 <_EIP>: Code; f898cf27 <[bproc]bproc_hook_proc_ppid+57/68> <===== 0: 8b 40 10 mov 0x10(%eax),%eax <===== Code; f898cf2a <[bproc]bproc_hook_proc_ppid+5a/68> 3: c3 ret Code; f898cf2b <[bproc]bproc_hook_proc_ppid+5b/68> 4: 90 nop Code; f898cf2c <[bproc]bproc_hook_proc_ppid+5c/68> 5: 8b 82 94 00 00 00 mov 0x94(%edx),%eax Code; f898cf32 <[bproc]bproc_hook_proc_ppid+62/68> b: 8b 40 7c mov 0x7c(%eax),%eax Code; f898cf35 <[bproc]bproc_hook_proc_ppid+65/68> e: c3 ret Code; f898cf36 <[bproc]bproc_hook_proc_ppid+66/68> f: 89 f6 mov %esi,%esi Code; f898cf38 <[bproc]bproc_hook_proc1+0/80> 11: f6 05 00 00 00 00 00 testb $0x0,0x0 -- Nicholas Henke Linux cluster system programmer University of Pennsylvania |
|
From: Wilton W. <ww...@ha...> - 2002-09-26 07:10:42
|
On Wed, 25 Sep 2002, Stanley, Matthew D. wrote: > I have been trying to make gigabit networking work in a 4 node cluster now > for several weeks. We have been trying both the Scyld 27Z-8 release and the > Clustermatic March 02 release with RH 7.2. Both situations produce similar > results. Using Dlink DGE-500T and Intel PRO/1000 gigabit ethernet cards we > have problems booting the network boot.img from the master server. I have I have had nothing but trouble with the DGE-500T's we tracked them down to a probelm with the Tyan 2466 motherboards we are using and to a BIOS related PCI interrupt issue and are working with Tyan to resolve this, it's not really high priority, since we don't have the same problem with the Intel e1000 cards and the cost difference now is negligable. I will see if I can scrounge up a couple of e1000 cards sometime this week and see if I can duplicate your problem in our setup here. I only have a few questions: Which e1000 drivers are you using ? What is your motherboard ? What switches are you using ? > I realize this means more administration work, but can I just install all > four machines identical and then use bpslave instead of the beoboot system? > If I do this, what modifications are required to the scripts (ie. node_up) to > provide similar functionality. These clusters are just used for NAMD > research. In this case I belive all node_up really has to do is exit with a "0" error number ie: <SNIP> #!/bin/sh exit 0 </SNIP> I don't think any other modification is nescessary.. tho' the bpslave file caching code for library loading on demand may be a bit screwy.. I haven't tried this configuration. > I read somewhere else on the list where Erik suggested using the mcastbcast > ethX parameter to fix these issues but it didn't seem to solve mine unless I The mcastbcast option makes normally multicast packets (file transfers in bproc) into broadcast packets, this is to accomadate switches/hardware that doesn't handle multicast packets quickly. I have been pretty swamped this week, doing other stuff for Hard Data Ltd. but soon I will have our patched and updated version of BProc ready for download from our website or ftp server, I'll announce it on the list when I put it up. - Wilton ----[ Wilton William Wong ]--------------------------------------------- 11060-166 Avenue Ph : 01-780-456-9771 High Performance UNIX Edmonton, Alberta FAX: 01-780-456-9772 and Linux Solutions T5X 1Y3, Canada URL: http://www.harddata.com -------------------------------------------------------[ Hard Data Ltd. ]---- |
|
From: Stanley, M. D. <bc...@mi...> - 2002-09-25 20:55:01
|
I have been trying to make gigabit networking work in a 4 node cluster = now for several weeks. We have been trying both the Scyld 27Z-8 release = and the Clustermatic March 02 release with RH 7.2. Both situations = produce similar results. Using Dlink DGE-500T and Intel PRO/1000 = gigabit ethernet cards we have problems booting the network boot.img = from the master server. I have tried using the newest drivers for each = card, and even recompiling the kernel after updating the drivers to = ensure no dependency issues. After the boot disk finds the network = cards with the updated driver it will grab an IP from the master server = and attempt to load the boot image. It typically will ask for the = boot.img and then not go any further, and then issue eth transmit timed = out messages. The drivers work great on the eth0 internet side, works = like a champ, no problems whatsoever, they just don't appear to work on = the cluster side. I have tried using switches and 100mbit hub and = neither work. Just for reference I have had these same four machines = working with via-rhine and 3c59x network drivers with no problems under = both the Scyld and Clustermatic release it just appears that neither = want to work with the gigabit cards. I realize this means more administration work, but can I just install = all four machines identical and then use bpslave instead of the beoboot = system? If I do this, what modifications are required to the scripts = (ie. node_up) to provide similar functionality. These clusters are just = used for NAMD research. I read somewhere else on the list where Erik suggested using the = mcastbcast ethX parameter to fix these issues but it didn't seem to = solve mine unless I didn't use it correctly, all I did was add a line to = the config file with mcastbcast eth1 and save the file and restart the = daemons. Any help would be greatly appreciated! Matt Stanley Systems Administrator Structural Biology Core University of Missouri - Columbia |
|
From: Nicholas H. <he...@se...> - 2002-09-21 16:50:03
|
nope -- all adaptec:
scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.8
<Adaptec aic7899 Ultra160 SCSI adapter>
aic7899: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs
scsi1 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.8
<Adaptec aic7899 Ultra160 SCSI adapter>
aic7899: Ultra160 Wide Channel B, SCSI Id=7, 32/253 SCBs
On Sat, 21 Sep 2002, Wilton Wong wrote:
> Are you running a symbois/LSI based SCSI controller ? just wondering, becaise
> we had a whole bunch of "running short of DMA buffer" errors untill we switched
> to the newer sym83cxx driver.. we woulget a whole bunch of these errors after a
> random amount of disk activity and then boom we would oops..
>
> - Wilton
>
|
|
From: Wilton W. <ww...@ha...> - 2002-09-21 16:42:27
|
Are you running a symbois/LSI based SCSI controller ? just wondering, becaise we had a whole bunch of "running short of DMA buffer" errors untill we switched to the newer sym83cxx driver.. we woulget a whole bunch of these errors after a random amount of disk activity and then boom we would oops.. - Wilton On Sat, 21 Sep 2002, Nicholas Henke wrote: > I am running both a 2.4.17 and 2.4.19 vanilla patched with bproc. I have 2 > oops traces from klogd, that appear to be related. I am not sure if this > is a bproc or kernel issue. The kernels have been running on Dell 1550 > PIII Dual machines with 2GB ram and 2GB swap and IBM x330s with the same > setups. I have seen about 20 machines toss this same oops in the last few > days, all when under heavy load and memory pressure. After oopsing, the > machine will remain resonsive and can run processes, baring ps, top, > shutdown and reboot -- I am sure there are more that won't run :) I would > greatly appreciate any help -- I have almost 200 machines running these > kernels. The oopsing has not just started recently -- I have seen it all > along, but have never been able to get the decoded info from klogd, and it > is just now becoming a problem with so many machines oopsing. > > Attached is one file with 2 oops reports from klogd. > Nic > > -- > Nicholas Henke > Linux cluster system programmer > University of Pennsylvania > he...@se... - 215.573.8149 Content-Description: oops traces > -------- 2.4.17 -------------- > > Sep 19 18:49:10 node25 kernel: kernel BUG at page_alloc.c:84! > Sep 19 18:49:10 node25 kernel: invalid operand: 0000 > Sep 19 18:49:10 node25 kernel: CPU: 1 > Sep 19 18:49:10 node25 kernel: EIP: 0010:[__free_pages_ok+169/832] Not tainted > Sep 19 18:49:10 node25 kernel: EIP: 0010:[<c013c0d9>] Not tainted > Sep 19 18:49:10 node25 kernel: EFLAGS: 00010286 > Sep 19 18:49:10 node25 kernel: eax: 0000001f ebx: c1b3ffa0 ecx: c0288424 edx: 000036aa > Sep 19 18:49:10 node25 kernel: esi: c1b3ffa0 edi: 00000000 ebp: 00000000 esp: c5d8dee0 > Sep 19 18:49:10 node25 kernel: ds: 0018 es: 0018 ss: 0018 > Sep 19 18:49:10 node25 kernel: Process ps (pid: 1326, stackpage=c5d8d000) > Sep 19 18:49:10 node25 kernel: Stack: c025c74d 00000054 ea5a4a41 e947f35c e947f000 ea5a4000 bfff0018 0000090f > Sep 19 18:49:10 node25 kernel: c1b3ffa0 e947f90f e947f90f c0125607 c5d8c000 e947f000 f21a270c c1b3ffa0 > Sep 19 18:49:10 node25 kernel: e0e9b980 f21a270c e9092000 e947f000 e947f000 c01691fa e9092000 bffff6e5 > Sep 19 18:49:10 node25 kernel: Call Trace: [access_process_vm+439/560] [proc_pid_environ+186/208] [proc_info_read+99/288] [filp_open+77/96] [sys_read+150/208] > Sep 19 18:49:10 node25 kernel: Call Trace: [<c0125607>] [<c01691fa>] [<c0169653>] [<c01442ed>] [<c0144e76>] > Sep 19 18:49:10 node25 kernel: [sys_open+203/304] [system_call+51/56] > Sep 19 18:49:10 node25 kernel: [<c01446db>] [<c01078ab>] > Sep 19 18:49:10 node25 kernel: > Sep 19 18:49:10 node25 kernel: Code: 0f 0b 5e 5f 8b 43 18 a9 80 00 00 00 74 10 6a 56 68 4d c7 25 > Sep 20 04:02:14 node25 syslogd 1.4.1: restart. > Sep 20 14:00:00 node25 sshd(pam_unix)[1655]: session opened for user bindu by (uid=0) > Sep 20 14:27:10 node25 sshd(pam_unix)[1655]: session closed for user bindu > Sep 20 15:22:51 node25 sshd(pam_unix)[1683]: session opened for user root by (uid=0) > Sep 20 15:23:37 node25 sshd(pam_unix)[1731]: session opened for user henken by (uid=0) > Sep 20 15:23:37 node25 sshd(pam_unix)[1731]: session closed for user henken > Sep 20 15:23:44 node25 sshd(pam_unix)[1739]: session opened for user henken by (uid=0) > Sep 20 15:23:44 node25 sshd(pam_unix)[1739]: session closed for user henken > > ----------- 2.4.19 ------ > > Sep 21 10:15:34 node25.io.liniac.upenn.edu kernel: Warning - running *really* short on DMA buffers > Sep 21 10:39:10 node25.io.liniac.upenn.edu last message repeated 156 times > Sep 21 10:39:10 node25.io.liniac.upenn.edu kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000010 > Sep 21 10:39:10 node25.io.liniac.upenn.edu kernel: printing eip: > Sep 21 10:39:10 node25.io.liniac.upenn.edu kernel: f89cee23 > Sep 21 10:39:10 node25.io.liniac.upenn.edu kernel: *pde = 00000000 > Sep 21 10:39:10 node25.io.liniac.upenn.edu kernel: Oops: 0000 > Sep 21 10:39:10 node25.io.liniac.upenn.edu kernel: CPU: 1 > Sep 21 10:39:10 node25.io.liniac.upenn.edu kernel: EIP: 0010:[<f89cee23>] Not tainted > Sep 21 10:39:10 node25.io.liniac.upenn.edu kernel: EFLAGS: 00010202 > Sep 21 10:39:10 node25.io.liniac.upenn.edu kernel: eax: 00000000 ebx: 00000000 ecx: 00000002 edx: f70c0000 > Sep 21 10:39:10 node25.io.liniac.upenn.edu kernel: esi: f70c0000 edi: 00000000 ebp: ffffffff esp: daabbe8c > Sep 21 10:39:10 node25.io.liniac.upenn.edu kernel: ds: 0018 es: 0018 ss: 0018 > Sep 21 10:39:10 node25.io.liniac.upenn.edu kernel: Process ps (pid: 2266, stackpage=daabb000) > Sep 21 10:39:10 node25.io.liniac.upenn.edu kernel: Stack: c015f525 f70c0000 000002cc 000002cc 00000000 ffffffff 00000004 0000002c > Sep 21 10:39:10 node25.io.liniac.upenn.edu kernel: 00000000 00000088 00000000 00000002 00000000 00000000 00000000 00000013 > Sep 21 10:39:10 node25.io.liniac.upenn.edu kernel: 00000000 00000000 00000000 004a190c 00000000 00000000 ffffffff 00000000 > Sep 21 10:39:10 node25.io.liniac.upenn.edu kernel: Call Trace: [proc_pid_stat+629/752] [do_exit+719/736] [proc_info_read+99/288] [sys_read+150/272] [sys_open+87/160] > Sep 21 10:39:10 node25.io.liniac.upenn.edu kernel: Call Trace: [<c015f525>] [<c011f86f>] [<c015d0a3>] [<c013de76>] [<c013d827>] > Sep 21 10:39:10 node25.io.liniac.upenn.edu kernel: [system_call+51/56] > Sep 21 10:39:10 node25.io.liniac.upenn.edu kernel: [<c0108cfb>] > Sep 21 10:39:10 node25.io.liniac.upenn.edu kernel: > Sep 21 10:39:10 node25.io.liniac.upenn.edu kernel: Code: 8b 40 10 c3 90 8b 82 94 00 00 00 8b 40 7c c3 89 f6 f6 05 60 > ----[ Wilton William Wong ]--------------------------------------------- 11060-166 Avenue Ph : 01-780-456-9771 High Performance UNIX Edmonton, Alberta FAX: 01-780-456-9772 and Linux Solutions T5X 1Y3, Canada URL: http://www.harddata.com -------------------------------------------------------[ Hard Data Ltd. ]---- |
|
From: Nicholas H. <he...@se...> - 2002-09-21 16:20:10
|
I am running both a 2.4.17 and 2.4.19 vanilla patched with bproc. I have 2 oops traces from klogd, that appear to be related. I am not sure if this is a bproc or kernel issue. The kernels have been running on Dell 1550 PIII Dual machines with 2GB ram and 2GB swap and IBM x330s with the same setups. I have seen about 20 machines toss this same oops in the last few days, all when under heavy load and memory pressure. After oopsing, the machine will remain resonsive and can run processes, baring ps, top, shutdown and reboot -- I am sure there are more that won't run :) I would greatly appreciate any help -- I have almost 200 machines running these kernels. The oopsing has not just started recently -- I have seen it all along, but have never been able to get the decoded info from klogd, and it is just now becoming a problem with so many machines oopsing. Attached is one file with 2 oops reports from klogd. Nic -- Nicholas Henke Linux cluster system programmer University of Pennsylvania he...@se... - 215.573.8149 |
|
From: Jack N. <jj...@pa...> - 2002-09-18 15:11:19
|
Found my problem..typo in Makefile.conf Node boots nicely now. But modutils are still hosed...oh well... Thanks! Jack Neely -- Jack Neely <sl...@qu...> Linux Realm Kit Administration and Development PAMS Computer Operations at NC State University GPG Fingerprint: 1917 5AC1 E828 9337 7AA4 EA6B 213B 765F 3B6A 5B89 |
|
From: Jack N. <jj...@pa...> - 2002-09-18 13:46:47
|
Replying to my own message... I've been able to forward port modutils. The RPMs and SRPM are on rk-devel with the other stuff. Does it work? Heck if I know. I've installed the new package on my head node, rebuild beoboot and reinstalled it but I still get the same error on my node. "Error getting file cache FD: Invaild argument" as produced by "bpslave -d -i -c /rootfs -p 10001 192.186.1.10". If I haven't mentioned before, this is BProc 3.2.0 and beoboot lanl.1.2. Thanks for your help! Jack Neely On Tue, Sep 17, 2002 at 04:43:09PM -0400, Jack Neely wrote: > Heh...no...I'm using stock 2.4.19 compiled in a rather Red Hat-ish > fashion. (This will end up as part of a kit that overlays Red Hat Linux > 7.3.) > > The RPMsI'm building can be found at > > ftp://rk-devel.pams.ncsu.edu/ncsubeo/7.3 > > They are desinged to overlay our kit that is at > > http://www.linux.ncsu.edu/realmkit > > Thanks! > Jack Neely > > On Tue, Sep 17, 2002 at 02:32:26PM -0600, Wilton Wong wrote: > > > > Oh if you patched the "Red Hat" kernel with the Bproc patch and tried to make > > it work.. you will run into scheduler problems there is issues with bproc vs. > > the O(1) scheduler inside the Red Hat kernel. > > > > - Wilton > > > > ----[ Wilton William Wong ]--------------------------------------------- > > 11060-166 Avenue Ph : 01-780-456-9771 High Performance UNIX > > Edmonton, Alberta FAX: 01-780-456-9772 and Linux Solutions > > T5X 1Y3, Canada URL: http://www.harddata.com > > -------------------------------------------------------[ Hard Data Ltd. ]---- > > > > > > > > ------------------------------------------------------- > > This SF.NET email is sponsored by: AMD - Your access to the experts > > on Hammer Technology! Open Source & Linux Developers, register now > > for the AMD Developer Symposium. Code: EX8664 > > http://www.developwithamd.com/developerlab > > _______________________________________________ > > BProc-users mailing list > > BPr...@li... > > https://lists.sourceforge.net/lists/listinfo/bproc-users > > -- > Jack Neely <sl...@qu...> > Linux Realm Kit Administration and Development > PAMS Computer Operations at NC State University > GPG Fingerprint: 1917 5AC1 E828 9337 7AA4 EA6B 213B 765F 3B6A 5B89 > > > ------------------------------------------------------- > This SF.NET email is sponsored by: AMD - Your access to the experts > on Hammer Technology! Open Source & Linux Developers, register now > for the AMD Developer Symposium. Code: EX8664 > http://www.developwithamd.com/developerlab > _______________________________________________ > BProc-users mailing list > BPr...@li... > https://lists.sourceforge.net/lists/listinfo/bproc-users -- Jack Neely <sl...@qu...> Linux Realm Kit Administration and Development PAMS Computer Operations at NC State University GPG Fingerprint: 1917 5AC1 E828 9337 7AA4 EA6B 213B 765F 3B6A 5B89 |
|
From: Jack N. <jj...@pa...> - 2002-09-17 20:43:16
|
Heh...no...I'm using stock 2.4.19 compiled in a rather Red Hat-ish
fashion. (This will end up as part of a kit that overlays Red Hat Linux
7.3.)
The RPMsI'm building can be found at
ftp://rk-devel.pams.ncsu.edu/ncsubeo/7.3
They are desinged to overlay our kit that is at
http://www.linux.ncsu.edu/realmkit
Thanks!
Jack Neely
On Tue, Sep 17, 2002 at 02:32:26PM -0600, Wilton Wong wrote:
>
> Oh if you patched the "Red Hat" kernel with the Bproc patch and tried to make
> it work.. you will run into scheduler problems there is issues with bproc vs.
> the O(1) scheduler inside the Red Hat kernel.
>
> - Wilton
>
> ----[ Wilton William Wong ]---------------------------------------------
> 11060-166 Avenue Ph : 01-780-456-9771 High Performance UNIX
> Edmonton, Alberta FAX: 01-780-456-9772 and Linux Solutions
> T5X 1Y3, Canada URL: http://www.harddata.com
> -------------------------------------------------------[ Hard Data Ltd. ]----
>
>
>
> -------------------------------------------------------
> This SF.NET email is sponsored by: AMD - Your access to the experts
> on Hammer Technology! Open Source & Linux Developers, register now
> for the AMD Developer Symposium. Code: EX8664
> http://www.developwithamd.com/developerlab
> _______________________________________________
> BProc-users mailing list
> BPr...@li...
> https://lists.sourceforge.net/lists/listinfo/bproc-users
--
Jack Neely <sl...@qu...>
Linux Realm Kit Administration and Development
PAMS Computer Operations at NC State University
GPG Fingerprint: 1917 5AC1 E828 9337 7AA4 EA6B 213B 765F 3B6A 5B89
|
|
From: Wilton W. <ww...@ha...> - 2002-09-17 20:32:42
|
Oh if you patched the "Red Hat" kernel with the Bproc patch and tried to make it work.. you will run into scheduler problems there is issues with bproc vs. the O(1) scheduler inside the Red Hat kernel. - Wilton ----[ Wilton William Wong ]--------------------------------------------- 11060-166 Avenue Ph : 01-780-456-9771 High Performance UNIX Edmonton, Alberta FAX: 01-780-456-9772 and Linux Solutions T5X 1Y3, Canada URL: http://www.harddata.com -------------------------------------------------------[ Hard Data Ltd. ]---- |