You can subscribe to this list here.
2003 
_{Jan}
(17) 
_{Feb}
(23) 
_{Mar}
(32) 
_{Apr}
(48) 
_{May}
(51) 
_{Jun}
(23) 
_{Jul}
(39) 
_{Aug}
(47) 
_{Sep}
(107) 
_{Oct}
(112) 
_{Nov}
(112) 
_{Dec}
(70) 

2004 
_{Jan}
(155) 
_{Feb}
(283) 
_{Mar}
(200) 
_{Apr}
(107) 
_{May}
(73) 
_{Jun}
(171) 
_{Jul}
(127) 
_{Aug}
(119) 
_{Sep}
(91) 
_{Oct}
(116) 
_{Nov}
(175) 
_{Dec}
(143) 
2005 
_{Jan}
(168) 
_{Feb}
(237) 
_{Mar}
(222) 
_{Apr}
(183) 
_{May}
(111) 
_{Jun}
(153) 
_{Jul}
(123) 
_{Aug}
(43) 
_{Sep}
(95) 
_{Oct}
(179) 
_{Nov}
(95) 
_{Dec}
(119) 
2006 
_{Jan}
(39) 
_{Feb}
(33) 
_{Mar}
(133) 
_{Apr}
(69) 
_{May}
(22) 
_{Jun}
(40) 
_{Jul}
(33) 
_{Aug}
(32) 
_{Sep}
(34) 
_{Oct}
(10) 
_{Nov}
(8) 
_{Dec}
(18) 
2007 
_{Jan}
(14) 
_{Feb}
(3) 
_{Mar}
(13) 
_{Apr}
(16) 
_{May}
(15) 
_{Jun}
(8) 
_{Jul}
(20) 
_{Aug}
(25) 
_{Sep}
(17) 
_{Oct}
(10) 
_{Nov}
(8) 
_{Dec}
(13) 
2008 
_{Jan}
(7) 
_{Feb}

_{Mar}
(1) 
_{Apr}
(6) 
_{May}
(15) 
_{Jun}
(22) 
_{Jul}
(22) 
_{Aug}
(5) 
_{Sep}
(5) 
_{Oct}
(17) 
_{Nov}
(3) 
_{Dec}
(1) 
2009 
_{Jan}
(2) 
_{Feb}

_{Mar}
(29) 
_{Apr}
(78) 
_{May}
(17) 
_{Jun}
(3) 
_{Jul}

_{Aug}

_{Sep}
(1) 
_{Oct}
(21) 
_{Nov}
(1) 
_{Dec}
(4) 
2010 
_{Jan}
(1) 
_{Feb}
(5) 
_{Mar}

_{Apr}
(5) 
_{May}
(7) 
_{Jun}
(14) 
_{Jul}
(5) 
_{Aug}
(72) 
_{Sep}
(25) 
_{Oct}
(5) 
_{Nov}
(14) 
_{Dec}
(12) 
2011 
_{Jan}
(9) 
_{Feb}

_{Mar}

_{Apr}
(3) 
_{May}
(3) 
_{Jun}
(2) 
_{Jul}

_{Aug}

_{Sep}

_{Oct}

_{Nov}

_{Dec}

2012 
_{Jan}

_{Feb}

_{Mar}

_{Apr}

_{May}

_{Jun}

_{Jul}
(10) 
_{Aug}
(18) 
_{Sep}
(2) 
_{Oct}
(1) 
_{Nov}

_{Dec}

2013 
_{Jan}
(1) 
_{Feb}
(3) 
_{Mar}

_{Apr}
(2) 
_{May}

_{Jun}

_{Jul}

_{Aug}

_{Sep}

_{Oct}

_{Nov}
(1) 
_{Dec}

2014 
_{Jan}

_{Feb}

_{Mar}

_{Apr}
(2) 
_{May}
(1) 
_{Jun}

_{Jul}

_{Aug}

_{Sep}

_{Oct}

_{Nov}

_{Dec}

S  M  T  W  T  F  S 





1

2

3

4

5
(3) 
6

7
(9) 
8

9
(2) 
10

11
(1) 
12
(5) 
13
(6) 
14
(14) 
15
(22) 
16
(10) 
17
(1) 
18

19
(7) 
20
(4) 
21
(4) 
22
(4) 
23
(3) 
24

25

26
(9) 
27
(17) 
28
(12) 
29
(2) 
30
(10) 
31
(10) 
From: Vera, Michael <mxvera@qw...>  20040126 23:22:58

Greetings, I have installed a two node cluster with a single nonshared root. I am trying to get a process to actually run on the secondary node, but I can't seem to get one to run there. I have set up new users to have the bashll shell by default, and I have run the command: `loadlevel a on` I looked in the /proc/cluster/loadlog and this is what it contains: [root@... cluster]# cat loadlog rexec :pid 67777(init) > node 2 mem 781376 my load 9 node2 load 21473466 rexec :pid 67780(init) > node 2 mem 780672 my load 9 node2 load 19612320 rexec :pid 67788(onnode) > node 2 mem 776368 my load 9 node2 load 18 rexec pid 67788(onnode) rexec failed to node 2 error 2 rexec :pid 67788(onnode) > node 2 mem 776368 my load 9 node2 load 18 rexec pid 67788(onnode) rexec failed to node 2 error 2 rexec :pid 67788(onnode) > node 2 mem 776368 my load 9 node2 load 18 rexec pid 67788(onnode) rexec failed to node 2 error 2 rexec :pid 67788(onnode) > node 2 mem 776368 my load 9 node2 load 18 rexec pid 67793(onnode) rexec failed to node 1 error 2 rexec pid 67793(onnode) rexec failed to node 1 error 2 rexec pid 67793(onnode) rexec failed to node 1 error 2 rexec :pid 67801(onnode) > node 2 mem 774916 my load 9 node2 load 18 rexec pid 67801(onnode) rexec failed to node 2 error 2 rexec :pid 67801(onnode) > node 2 mem 774916 my load 9 node2 load 18 rexec pid 67801(onnode) rexec failed to nrexec pid 67801(onnode) rexec failed to node 2 error 2 rexec :pid 67801(onnode) > node 2 mem 774916 my load 9 node2 load 18 rexec pid 67801(onnode) rexec failed to node 2 error 2 rexec :pid 67801(onnode) > node 2 mem 774916 my load 9 node2 load 18 rexec :pid 67840(onnode) > node 2 mem 771104 my load 9 node2 load 18 rexec pid 67840(onnode) rexec failed to node 2 error 2 rexec :pid 67840(onnode) > node 2 mem 771104 my load 9 node2 load 18 rexec pid 67840(onnode) rexec failed to node 2 error 2 rexec :pid 67840(onnode) > node 2 mem 771104 my load 9 node2 load 18 rexec pid 67840(onnode) rexec failed to node 2 error 2 rexec :pid 67840(onnode) > node 2 mem 771104 my load 9 node2 load 18 rexec pid 67841(onnode) rexec failed to node 1 error 2 rexec pid 67841(onnode) rexec failed to node 1 error 2 rexec pid 67841(onnode) rexec failed to node 1 error 2 rexec :pid 68022(init) > node 2 mem 738976 my load 9 node2 load 21473466 rexec :pid 68023(init) > node 2 mem 738836 my load 9 noderexec :pid 68023(init) > node 2 mem 738836 my load 9 node2 load 5334948 rexec :pid 68061(init) > node 2 mem 736848 my load 9 node2 load 21473466 rexec :pid 68062(init) > node 2 mem 736680 my load 9 node2 load 5365261 rexec :pid 68063(init) > node 2 mem 736456 my load 9 node2 load 18 rexec :pid 68067(onnode) > node 2 mem 736420 my load 9 node2 load 18 rexec pid 68067(onnode) rexec failed to node 2 error 2 rexec :pid 68067(onnode) > node 2 mem 736420 my load 9 node2 load 18 rexec pid 68067(onnode) rexec failed to node 2 error 2 rexec :pid 68067(onnode) > node 2 mem 736420 my load 9 node2 load 18 rexec pid 68067(onnode) rexec failed to node 2 error 2 rexec :pid 68067(onnode) > node 2 mem 736420 my load 9 node2 load 18 rexec :pid 68876(onnode) > node 2 mem 733640 my load 14 node2 load 1 rexec pid 68876(onnode) rexec failed to node 2 error 66 rexec :pid 68909(init) > node 2 mem 733740 my load 9 node2 load 21473466 rexec :pid 68910(init) > node 2 mem 733740 my load 9 node2 load 52rexec :pid 68910(init) > node 2 mem 733740 my load 9 node2 load 5275340 rexec :pid 68976(onnode) > node 2 mem 732808 my load 9 node2 load 18 rexec pid 68976(onnode) rexec failed to node 2 error 2 rexec :pid 68976(onnode) > node 2 mem 732808 my load 9 node2 load 18 rexec pid 68976(onnode) rexec failed to node 2 error 2 rexec :pid 68976(onnode) > node 2 mem 732808 my load 9 node2 load 18 rexec pid 68976(onnode) rexec failed to node 2 error 2 rexec :pid 68976(onnode) > node 2 mem 732808 my load 9 node2 load 18 I checked the rexec man page and tried this by hand: [root@... cluster]# rexec d bilbo2 date rexec: Host =3D bilbo2 rexec: Command to execute =3D date Password: bilbo2: Connection refused rexec: Error in rexec system call, rexec: (The following system error may itself be in error) rexec: Connection refused How do I allow rexec to execute on the secondary node? Is this an xinetd.conf issue? It wasn't addresssed in the installation manual. Thanks, Michael Vera 
From: Verdun, Jeanmarie <jeanmarie.verdun@hp...>  20040126 23:21:12

Hi, We do have an issue to setup a 2 nodes cluster based on DL 360 G3. In = fact we do not have shared root through SCSI or FC available between the = 2 nodes. Is there any way to avoid the shared root and having a = client/server approach ? If yes, I do beleave that an HOWTO can be = usefull. If this is a stupid question excuse me for it ... Have a good day, Jm ps: Here is the mail I receive from my test engineer > =20 > Hi, >=20 > I installed Openssi 1.0 on a dl360G3 with a Redhat 9.0 distribution. > I had to install RH 8 first and then RH9, this because of a cdrom = speed error which prevented me to boot from RH 9 cd. >=20 > Then I install OpenSSI and added a node. > The first node start the boot process then I have a panic : > ...>=20 > Ipcname_read completed > Instrucyion(i) breakpoint #0 at 0x0124c70 (adjusted) > 0xc0124c70 panic int3 > Entering kdb (current=3D0xf709a000, pid 131151) on process 0 due to = breakpoint @ 0xc0124c70 > Kdb> >=20 > Any idea? >=20 > Thanks for your help >=20 > Didier >=20 
From: John Byrne <john.byrne@hp...>  20040126 22:12:49

The mailing lists have been down. Sourceforge says they are fixed. John 
From: Vera, Michael <mxvera@qw...>  20040126 18:12:04

That would be ideal. Software raid over a network might be slow, but talk about High Availability. :) I will build these boxes using the secondary node for /home and eagerly await a DRBD implementation with SSI. Thank you, Michael Vera Original Message From: Walker, Bruce J [mailto:bruce.walker@...]=20 Sent: Monday, January 26, 2004 11:16 AM To: Vera, Michael; ssiclinuxusers@... Subject: RE: [SSIusers] HP DL360 Cluster  Shared Root Michael, Sorry I missed your real question. Currently the shared root must be physically shared. One of the "ongoing" projects is to use DRBD as a way to software replicate filesystems between nonshared disks on different nodes. The DRBD team was interested in helping when they got the release they were working on out. By now that should be done and we should reengage them. Bruce =20 > Original Message > From: ssiclinuxusersadmin@... > [mailto:ssiclinuxusersadmin@...] On=20 > Behalf Of Vera, Michael > Sent: Monday, January 26, 2004 8:20 AM > To: ssiclinuxusers@... > Subject: RE: [SSIusers] HP DL360 Cluster  Shared Root >=20 >=20 > I guess my question is, can the drives on the secondary node > become the > root "failover" in case the primary node crashes? Or do the=20 > nodes *have* > to use an external disk array to share root? >=20 > Thank you, > Michael Vera >=20 > Original Message > From: Walker, Bruce J [mailto:bruce.walker@...] > Sent: Sunday, January 25, 2004 11:17 AM > To: Vera, Michael; ssiclinuxusers@... > Subject: RE: [SSIusers] HP DL360 Cluster  Shared Root >=20 >=20 > You needn't "waste" the space on your secondary nodes. Any node can=20 > "contribute" filesystems to the cluster and be transparently and=20 > coherently shared by all users/processes on all nodes. > Obviously if a > given node is lost, data stored on that node is unavailable. >=20 > Bruce >=20 >=20 > > Original Message > > From: ssiclinuxusersadmin@... > > [mailto:ssiclinuxusersadmin@...] On > > Behalf Of Vera, Michael > > Sent: Friday, January 23, 2004 11:10 AM > > To: ssiclinuxusers@... > > Subject: [SSIusers] HP DL360 Cluster  Shared Root > >=20 > >=20 > > Has anyone set up SSI on an HP DL360 with shared root? The only > > shortcoming I see to this clustering strategy is wasting my=20 > 72GB RAID > > 1 setup on the secondary nodes. > >=20 > > Please let me know. > >=20 > > Thank you, > > Michael Vera > > mxvera@... > >=20 > >=20 > >  > > The SF.Net email is sponsored by EclipseCon 2004 > > Premiere Conference on Open Tools Development and > Integration See the > > breadth of Eclipse activity. February 35 in Anaheim, CA. > > http://www.eclipsecon.org/osdn=20 > > _______________________________________________ > > Ssiclinuxusers mailing list Ssiclinuxusers@... > > https://lists.sourceforge.net/lists/listinfo/ssiclinuxusers > >=20 >=20 >=20 >  > The SF.Net email is sponsored by EclipseCon 2004 > Premiere Conference on Open Tools Development and Integration See the=20 > breadth of Eclipse activity. February 35 in Anaheim, CA.=20 > http://www.eclipsecon.org/osdn=20 > _______________________________________________ > Ssiclinuxusers mailing list Ssiclinuxusers@... > https://lists.sourceforge.net/lists/listinfo/ssiclinuxusers >=20 
From: Walker, Bruce J <bruce.walker@hp...>  20040126 17:16:43

Michael, Sorry I missed your real question. Currently the shared root must be physically shared. One of the "ongoing" projects is to use DRBD as a way to software replicate filesystems between nonshared disks on different nodes. The DRBD team was interested in helping when they got the release they were working on out. By now that should be done and we should reengage them. Bruce =20 > Original Message > From: ssiclinuxusersadmin@...=20 > [mailto:ssiclinuxusersadmin@...] On=20 > Behalf Of Vera, Michael > Sent: Monday, January 26, 2004 8:20 AM > To: ssiclinuxusers@... > Subject: RE: [SSIusers] HP DL360 Cluster  Shared Root >=20 >=20 > I guess my question is, can the drives on the secondary node=20 > become the > root "failover" in case the primary node crashes? Or do the=20 > nodes *have* > to use an external disk array to share root? >=20 > Thank you, > Michael Vera >=20 > Original Message > From: Walker, Bruce J [mailto:bruce.walker@...]=20 > Sent: Sunday, January 25, 2004 11:17 AM > To: Vera, Michael; ssiclinuxusers@... > Subject: RE: [SSIusers] HP DL360 Cluster  Shared Root >=20 >=20 > You needn't "waste" the space on your secondary nodes. Any node can > "contribute" filesystems to the cluster and be transparently and > coherently shared by all users/processes on all nodes. =20 > Obviously if a > given node is lost, data stored on that node is unavailable. >=20 > Bruce >=20 >=20 > > Original Message > > From: ssiclinuxusersadmin@... > > [mailto:ssiclinuxusersadmin@...] On=20 > > Behalf Of Vera, Michael > > Sent: Friday, January 23, 2004 11:10 AM > > To: ssiclinuxusers@... > > Subject: [SSIusers] HP DL360 Cluster  Shared Root > >=20 > >=20 > > Has anyone set up SSI on an HP DL360 with shared root? The only=20 > > shortcoming I see to this clustering strategy is wasting my=20 > 72GB RAID=20 > > 1 setup on the secondary nodes. > >=20 > > Please let me know. > >=20 > > Thank you, > > Michael Vera > > mxvera@... > >=20 > >=20 > >  > > The SF.Net email is sponsored by EclipseCon 2004 > > Premiere Conference on Open Tools Development and=20 > Integration See the=20 > > breadth of Eclipse activity. February 35 in Anaheim, CA.=20 > > http://www.eclipsecon.org/osdn=20 > > _______________________________________________ > > Ssiclinuxusers mailing list Ssiclinuxusers@... > > https://lists.sourceforge.net/lists/listinfo/ssiclinuxusers > >=20 >=20 >=20 >  > The SF.Net email is sponsored by EclipseCon 2004 > Premiere Conference on Open Tools Development and Integration > See the breadth of Eclipse activity. February 35 in Anaheim, CA. > http://www.eclipsecon.org/osdn > _______________________________________________ > Ssiclinuxusers mailing list > Ssiclinuxusers@... > https://lists.sourceforge.net/lists/listinfo/ssiclinuxusers >=20 
From: Anne, Didier <didier.anne@hp...>  20040126 17:03:44

Hi, I installed Openssi 1.0 on a dl360 with a Redhat 9.0 distribution and I added a node. When I started this added node the kelnel is loaded and then I have a panic : ... Ipcname_read completed Instrucyion(i) breakpoint #0 at 0x0124c70 (adjusted) 0xc0124c70 panic int3 Entering kdb (current=3D0xf709a000, pid 131151) on process 0 due to breakpoint @ 0xc0124c70 Kdb> Thanks for your help Didier 
From: Vera, Michael <mxvera@qw...>  20040126 16:20:41

I guess my question is, can the drives on the secondary node become the root "failover" in case the primary node crashes? Or do the nodes *have* to use an external disk array to share root? Thank you, Michael Vera Original Message From: Walker, Bruce J [mailto:bruce.walker@...]=20 Sent: Sunday, January 25, 2004 11:17 AM To: Vera, Michael; ssiclinuxusers@... Subject: RE: [SSIusers] HP DL360 Cluster  Shared Root You needn't "waste" the space on your secondary nodes. Any node can "contribute" filesystems to the cluster and be transparently and coherently shared by all users/processes on all nodes. Obviously if a given node is lost, data stored on that node is unavailable. Bruce > Original Message > From: ssiclinuxusersadmin@... > [mailto:ssiclinuxusersadmin@...] On=20 > Behalf Of Vera, Michael > Sent: Friday, January 23, 2004 11:10 AM > To: ssiclinuxusers@... > Subject: [SSIusers] HP DL360 Cluster  Shared Root >=20 >=20 > Has anyone set up SSI on an HP DL360 with shared root? The only=20 > shortcoming I see to this clustering strategy is wasting my 72GB RAID=20 > 1 setup on the secondary nodes. >=20 > Please let me know. >=20 > Thank you, > Michael Vera > mxvera@... >=20 >=20 >  > The SF.Net email is sponsored by EclipseCon 2004 > Premiere Conference on Open Tools Development and Integration See the=20 > breadth of Eclipse activity. February 35 in Anaheim, CA.=20 > http://www.eclipsecon.org/osdn=20 > _______________________________________________ > Ssiclinuxusers mailing list Ssiclinuxusers@... > https://lists.sourceforge.net/lists/listinfo/ssiclinuxusers >=20 
From: Walker, Bruce J <bruce.walker@hp...>  20040126 14:08:14

George, I would like to take the debugging of this off the main list and report to the list our results when we have some, if that is ok with you. If others would like to be included, I have no problem and they should let you or I know. Please include myself and laura. Going forward I would like it if you can try a couple of things and provide some output. Specifically, For a "bad" and a "good" run, provide the complete monitoring output you provided before + a single cat of /proc/cluster/loadlog from each node, done at the end of the run. Also, is it possible for you to change the launching of the processes to put a sleep(1) between each of the forks. This isn't a long range solution but I am interested in what effect it will have (my hope is that exec load leveling will do a perfect job and no process migration will be necessary). Laura will look into putting the timestamps that you suggested and also changing the semantics of reading the log so that you can do a tail f of it to get the history as it happens. We may also want to adjust the frequency that the load level algorithm runs (currently once a second). Thanks, bruce > Original Message > From: Mason George S JR NPRI [mailto:MasonGS@...]=20 > Sent: Thursday, January 22, 2004 4:35 AM > To: Walker, Bruce J > Cc: 'ssiclinuxusers' > Subject: RE: [SSIusers] Problem Migrating Matlab Processes Sometimes? >=20 >=20 > All the 16 processes are started at once. The parent=20 > processes do not spawn any children. The program opens a=20 > file, writes a time stamp, enters a loop to do some fft=20 > calculations, writes a time stamp, and then closes the file. >=20 > I have noticed that during the bad runs the system tends to=20 > lock up for minutes at a time. I'm not able to control the=20 > mouse, keyboard, etc. During the good runs, the system is=20 > fully responsive. >=20 > Original Message > From: Walker, Bruce J [mailto:bruce.walker@...]=20 > Sent: Saturday, January 17, 2004 2:08 AM > To: Mason George S JR NPRI > Cc: lramirez@... > Subject: RE: [SSIusers] Problem Migrating Matlab Processes Sometimes? >=20 >=20 > George, > Didn't cc the whole list. > I notice that in the good run there was 3 execll's at the=20 > beginning. This may be important. Can you describe how=20 > matlab runs  in particular are all the processes started=20 > effective all at once? Is there any hierarchy in the=20 > processes (parent processes that spawn a set of children)? =20 > Anything you can say about any phases the computation goes=20 > thru or i/o networking it might do? >=20 > One possibility that might explain some of the behavior is=20 > that either get the execll or not depending on whether all=20 > the processes are started before the loadlevel daemon gets to=20 > run and notice the load going up. If you get some exectime=20 > loadlevelling, then those processes never get a chance to=20 > load down the startup node and/or run it out of memory; maybe=20 > they even do some forking on the other nodes once they get=20 > there to help the distribution even more. >=20 > The "bad" runs seem to show a very low "mem" number. >=20 > Laura has been real busy helping with the 1.0 release we are doing > today. Next week she may be able to help. The idea of=20 > allowing you to > do a tail f /proc/cluster/loadlog has been bounced around=20 > but the support for it didn't make the first cut. Timestamp=20 > shouldn't be too > hard. I'm not sure what the overlap means. Laura can hopefully > answer that. >=20 > In 1.0 we have increased the frequency of the node monitoring=20 > which will hopefully prevent nodes from inadvertantly leaving=20 > the cluster and has the effect that the loads of other nodes=20 > is going to be more "current". It may make sense to increase=20 > the frequency of making loadlevel decisions as well (and=20 > maybe make that tunable?). >=20 > We are very happy that you are willing to help us make this better. >=20 > Thanks, > bruce >=20 >=20 > > Original Message > > From: ssiclinuxusersadmin@... > > [mailto:ssiclinuxusersadmin@...] On=20 > > Behalf Of Mason George S JR NPRI > > Sent: Friday, January 16, 2004 7:28 AM > > To: Walker, Bruce J > > Cc: 'ssiclinuxusers'; 'lramirez@...' > > Subject: RE: [SSIusers] Problem Migrating Matlab Processes=20 > Sometimes? > >=20 > >=20 > > The /proc/cluster/loadlog does exist in 0.9.96! > > I executed "cat loadlog grep a matlab" at approximately 35=20 > > second intervals. Here is an excerpt of what I observed=20 > > during a good run: > > execll:pid 113598(matlab) > node 2 mem 354948 my load 167=20 > > node2 load 34 > > execll:pid 113599(matlab) > node 3 mem 346312 my load 191=20 > > node3 load 14 > > execll:pid 113600(matlab) > node 4 mem 361264 my load 181=20 > > node4 load 12 > > loadbl:pid 112144(matlab) > node 3 mem 343104 my load 178=20 > > node3 load 25 > > loadbl:pid 112176(matlab) > node 3 mem 279788 my load 178=20 > > node3 load 36 > > loadbl:pid 112160(matlab) > node 4 mem 247292 my load 177=20 > > node4 load 23 > > loadbl:pid 112155(matlab) > node 4 mem 157756 my load 176=20 > > node4 load 34 > > loadbl:pid 112130(matlab) > node 4 mem 66336 my load 168=20 > > node4 load 35 > > loadbl:pid 112261(matlab) > node 2 mem 154720 my load 150=20 > > node2 load 72 > > loadbl:pid 112239(matlab) > node 2 mem 167308 my load 132=20 > > node2 load 77 > > loadbl:pid 112216(matlab) > node 4 mem 136724 my load 132=20 > > node4 load 73 > > loadbl:pid 112197(matlab) > node 3 mem 262196 my load 113=20 > > node3 load 74 > > loadlb:pid 112249(matlab) < node 2 mem 309340 my load 49=20 > > node2 load 79 > > =20 > > Here is an excerpt of what I observed during a bad run: loadbl:pid=20 > > 116531(matlab) > node 2 mem 7196 my load 108 node2 load 72=20 > loadbl:pid=20 > > 116653(matlab) > node 2 mem 7356 my load 103 node2 load 33=20 > loadbl:pid=20 > > 116600(matlab) > node 2 mem 31480 my load 98 nodeloadbl:pid=20 > > 116600(matlab) > node 2 mem 31480 my l oad 98 node2 load 44 > > loadbl:pid 116631(matlab) > node 2 mem 79188 my load 110=20 > > node2 load 51 > > loadbl:pid 116620(matlab) > node 3 mem 7392 my load 114=20 > node3 load 23 > >=20 > > [cusers@... cluster]$ cat loadlog grep a matlab loadbl:pid=20 > > 116531(matlab) > node 2 mem 7196 my load 108 node2 load 72=20 > loadbl:pid=20 > > 116653(matlab) > node 2 mem 7356 my load 103 node2 load 33=20 > loadbl:pid=20 > > 116600(matlab) > node 2 mem 31480 my load 98 node2 load 44=20 > loadbl:pid=20 > > 116631(matlab) > node 2 mem 79188 my load 110 node2 load 51 > > loadbl:pid 116620(matlab) > node 3 mem 7392 my load 114=20 > node3 load 23 > >=20 > > [cusers@... cluster]$ cat loadlog grep a matlab loadbl:pid=20 > > 116531(matlab) > node 2 mem 7196 my load 108 node2 load 72=20 > loadbl:pid=20 > > 116653(matlab) > node 2 mem 7356 my load 103 node2 load 33=20 > loadbl:pid=20 > > 116600(matlab) > node 2 mem 31480 my load 98 node2 load 44=20 > loadbl:pid=20 > > 116631(matlab) > node 2 mem 79188 my load 110 node2 load 51 > > loadbl:pid 116620(matlab) > node 3 mem 7392 my load 114=20 > node3 load 23 > >=20 > > I didn't notice any migrate failures within > > /proc/cluster/loadlog or /var/log/messages. > >=20 > > I did notice that some migrate entries tend to stomp on other > > migrate entries. Should I be concerned with this? > >=20 > > Do you have a recommended way to capture all of the loadlog > > info? Also, would it be possible to insert timestamps into=20 > > the loadlog? > >=20 > > I'm willing to try an recommended test that you may think of! > >=20 > >=20 > >  > > The SF.Net email is sponsored by EclipseCon 2004 > > Premiere Conference on Open Tools Development and=20 > Integration See the=20 > > breadth of Eclipse activity. February 35 in Anaheim, CA.=20 > > http://www.eclipsecon.org/osdn=20 > > _______________________________________________ > > Ssiclinuxusers mailing list Ssiclinuxusers@... > > https://lists.sourceforge.net/lists/listinfo/ssiclinuxusers > >=20 >=20 
From: JiannMing Su <jsu2@em...>  20040126 07:09:20

On Tue, 20 Jan 2004, David B. Zafman wrote: > JiannMing Su wrote: > > >Okay, I compiled the RH kernel from the 0.9.96 release on my debian system. > >I've downloaded Aneesh's new deb packages. But, I get the following boot > >error: > > > >do_ssisys: Illegal op 44 in state 0 > >Usage: init 0123456SsQqAaBbCcUu > > > >The system doesn't seem to freeze, but it doesn't continue any farther. I'm > >able to reboot it with CtrlAltDel. > > > > > > > I wonder if you are building a proper RAMDISK for debian. We modified > the mkinitrd in redhat to add SSI specific action during boot. The > "do_ssisys" by itself may not be a problem, but it does indicate that at > the point the SSI initialization system calls have not been executed > (state 0). The "Usage" from init is also very weird. Hopefully, Annesh > has some insights. > Okay, I was able to run the ssicreate script successfully. As future reference, you want to make sure /var/log isn't a mounted filesystem because the script wants to move the existing log directory. I made the initrd image as follows: mkinitrd d /etc/mkinitrdopenssi r /dev/sda4 o /tmp/initrd2.4.20ssi.img /lib/modules/2.4.20pre3/ It kindly gzipped it for me, and I added it to my boot directory. My grub entry looks like title OpenSSI root (hd0,0) kernel /boot/vmlinuz2.4.20openssi root=/dev/sda4 init=/linuxrc initrd /boot/initrd2.4.20ssi.img.gz However, I still run into the same problem: Trying to move old root to /initrd ... okay Freeing unused kernel memory: 112k freed do_ssisys: Illegal op 44 in state 0 Usage: init 0123456SsQqAaBbCcUu The system doesn't lock up but just kind of hangs there. I can reboot with CtrlAltDel. I'm not sure if this matters, but I'm booting with grub from a cdrom.  JiannMing Su jsu2@... 4047122603 Development Team Systems Administrator General Libraries Systems Division 