From: Joshua J. E. <jj...@sa...> - 2002-10-29 23:06:46
|
OK, the problem is definitely with the kernel image. The slave nodes complain: 'Loading 10.0.4.100:/bproc/vmlinuz-beoboot error: not a valid image' This image was created with 'beoboot -2 -n -o vmlinuz-beoboot' from a bproc 2.4.19 kernel. What could be wrong? Here is the slave output: ... ... bus 00, function 00, vendor 8086, device 7100 bus 00, function 38, vendor 8086, device 7110 bus 00, function 39, vendor 8086, device 7111 bus 00, function 3A, vendor 8086, device 7112 bus 00, function 3B, vendor 8086, device 7113 bus 00, function 90, vendor 8086, device 1209 FOUND at bus 0x00000000, devfn 0x00000090 at reg 0x00000010 ioaddr is 0x80000000 at reg 0x00000014 ioaddr is 0x00001041 After mask op ioaddr is 0x00001040 Found Intel EtherExpressPro100 82559ER at 0X1040, ROM address 0X0000 Probing...[EEPRO100]Checking to see if BIOS properly set the 82557 to be the bus master in eepro100_probe Checking if PCI latency timer is correct in eepro100_probe Ethernet addr: 00:30:59:00:98:26 Searching for server (DHCP)... Sending packets in bootp Before entering await_reply... After await_reply, before udp_transmit in bootp Before entering eth_transmit in udp_transmit Before entering eth_transmit in udp_transmit After load_configuration in main Entering load Me: 10.0.4.10, Server: 10.0.4.100 Before loading kernel in load Loading 10.0.4.100:/bproc/vmlinuz-beoboot error: not a valid image Unable to load file. <sleep> <abort> bus 00, function 00, vendor 8086, device 7100 bus 00, function 38, vendor 8086, device 7110 bus 00, function 39, vendor 8086, device 7111 bus 00, function 3A, vendor 8086, device 7112 bus 00, function 3B, vendor 8086, device 7113 bus 00, function 90, vendor 8086, device 1209 FOUND at bus 0x00000000, devfn 0x00000090 at reg 0x00000010 ioaddr is 0x80000000 at reg 0x00000014 ioaddr is 0x00001041 After mask op ioaddr is 0x00001040 Found Intel EtherExpressPro100 82559ER at 0X1040, ROM address 0X0000 Probing...[EEPRO100]Checking to see if BIOS properly set the 82557 to be the bus master in eepro100_probe Checking if PCI latency timer is correct in eepro100_probe Ethernet addr: 00:30:59:00:98:26 Searching for server (DHCP)... Sending packets in bootp Before entering await_reply... After await_reply, before udp_transmit in bootp Before entering eth_transmit in udp_transmit Before entering eth_transmit in udp_transmit After load_configuration in main Entering load Me: 10.0.4.10, Server: 10.0.4.100 Before loading kernel in load Loading 10.0.4.100:/bproc/vmlinuz-beoboot error: not a valid image Unable to load file. <sleep> <abort> ... ... -JE ----------------------------------------------- Josh England Sandia National Laboratory, Livermore, CA Distributed Information Systems email: jj...@sa... phone: (925) 294-2076 On Mon, 2002-10-28 at 13:28, Joshua J. England wrote: > Hello, > > //** THE SETUP ** > I've got a test cluster that works using bproc-3.1.9 (RH7.2 master node) > from the March ClusterMatic CD. I'm trying to build a new master node > (RH8.0) from source using bproc-3.2.2 with beoboot-lanl.1.3. Beowulf > starts up clean. > > Nodes all boot with linuxbios, so I don't need to muck with a phase 1 > kernel. > > The phase 2 kernel was built with: > 'beoboot -2 -n -o vmlinuz-beoboot'. > > > //** THE PROBLEM ** > When a slave boots, it gets stuck in an infinte loop like such: > while (1) { > // slave issues dhpc request > // slave does arp for master -- master responds > // dhcp serves up the kernel > // new in.tftpd process starts up on master > // slave starts the tftp download and downloads a few blocks > } > > I end up with tons of tftp daemons all trying to serve a single node, > and beoserv never receives a RARP. > > This seems detached from bproc master problems --stopping beowulf > produces the same effect. > > So the question is: has anyone seen this before? What is causing the > slave to continue to issue DHCP requests after the first request > seemingly succeeds? Everything works fine when using the 3.1.9 master > node. Is this merely another SUA (Stupid User Artifact) where the > answer should be blindingly obvious? > > Thanks for any help, > > -JE > ----------------------------------------------- > Josh England > Sandia National Laboratory, Livermore, CA > Distributed Information Systems > email: jj...@sa... > phone: (925) 294-2076 > > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > BProc-users mailing list > BPr...@li... > https://lists.sourceforge.net/lists/listinfo/bproc-users |