You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(25) |
Nov
|
Dec
(22) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(13) |
Feb
(22) |
Mar
(39) |
Apr
(10) |
May
(26) |
Jun
(23) |
Jul
(38) |
Aug
(20) |
Sep
(27) |
Oct
(76) |
Nov
(32) |
Dec
(11) |
2003 |
Jan
(8) |
Feb
(23) |
Mar
(12) |
Apr
(39) |
May
(1) |
Jun
(48) |
Jul
(35) |
Aug
(15) |
Sep
(60) |
Oct
(27) |
Nov
(9) |
Dec
(32) |
2004 |
Jan
(8) |
Feb
(16) |
Mar
(40) |
Apr
(25) |
May
(12) |
Jun
(33) |
Jul
(49) |
Aug
(39) |
Sep
(26) |
Oct
(47) |
Nov
(26) |
Dec
(36) |
2005 |
Jan
(29) |
Feb
(15) |
Mar
(22) |
Apr
(1) |
May
(8) |
Jun
(32) |
Jul
(11) |
Aug
(17) |
Sep
(9) |
Oct
(7) |
Nov
(15) |
Dec
|
From: <ha...@no...> - 2002-10-30 16:17:46
|
Is it hard to reconcile bproc with highmem ? I found J.A. Magallon's patch collection 2.4.20-pre10-jam1 and wanted to use it on Abit-IT7-MAX2 based cluster (patches contain both HPT374 support and bproc kernel patch; this made me happy). I did few replacements in bproc (using J.A.M.'s hints): nice = current->nice -----> nice = task_nice(current) current->nice = nice -----> set_user_nice(current,nice) DEF_NICE -----> (0) and bproc-3.2.2 compiled OK. However depmod complains about unresolved symbol kmap_pagetable in vmadump.o. This is caused by new highmem (1G and more memory support) and there already was lot of buzz regarding many drivers broken by highmem. There seem to be quick 'fixes' like replacing #include <asm/pgtable.h> with #include <linux/highmem.h>, but there is more to it: Andrea Arcangeli May 4 2002: > You should #include <linux/highmem.h> in those drivers .c files, then it > will compile, but that's not the right fix, you'd need to add the > pte_kunmap too or it would deadlock with highmem. The right fix is to > convert those drivers to vmalloc_to_page, then they will work flawlessy. > Alan actually has a patch in his -ac that converted most usb and other > drivers to vmalloc_to_page, I will merge it plus I will convert those > below drivers if they're not just covered by Alan's patch. Alan could > you push it to Marcelo? So, my question is - what does this meen for bproc? Adding <linux/highmem.h> somewhere, (complicated?) conversion to vmalloc_to_page or something more? 1G or more memory is certainly reasonable on clusters these days, what should be done to have bproc working with these? Or did I miss something and bproc can work with 1Gup without highmem? Best Regards Vaclav |
From: Joshua J. E. <jj...@sa...> - 2002-10-29 23:06:46
|
OK, the problem is definitely with the kernel image. The slave nodes complain: 'Loading 10.0.4.100:/bproc/vmlinuz-beoboot error: not a valid image' This image was created with 'beoboot -2 -n -o vmlinuz-beoboot' from a bproc 2.4.19 kernel. What could be wrong? Here is the slave output: ... ... bus 00, function 00, vendor 8086, device 7100 bus 00, function 38, vendor 8086, device 7110 bus 00, function 39, vendor 8086, device 7111 bus 00, function 3A, vendor 8086, device 7112 bus 00, function 3B, vendor 8086, device 7113 bus 00, function 90, vendor 8086, device 1209 FOUND at bus 0x00000000, devfn 0x00000090 at reg 0x00000010 ioaddr is 0x80000000 at reg 0x00000014 ioaddr is 0x00001041 After mask op ioaddr is 0x00001040 Found Intel EtherExpressPro100 82559ER at 0X1040, ROM address 0X0000 Probing...[EEPRO100]Checking to see if BIOS properly set the 82557 to be the bus master in eepro100_probe Checking if PCI latency timer is correct in eepro100_probe Ethernet addr: 00:30:59:00:98:26 Searching for server (DHCP)... Sending packets in bootp Before entering await_reply... After await_reply, before udp_transmit in bootp Before entering eth_transmit in udp_transmit Before entering eth_transmit in udp_transmit After load_configuration in main Entering load Me: 10.0.4.10, Server: 10.0.4.100 Before loading kernel in load Loading 10.0.4.100:/bproc/vmlinuz-beoboot error: not a valid image Unable to load file. <sleep> <abort> bus 00, function 00, vendor 8086, device 7100 bus 00, function 38, vendor 8086, device 7110 bus 00, function 39, vendor 8086, device 7111 bus 00, function 3A, vendor 8086, device 7112 bus 00, function 3B, vendor 8086, device 7113 bus 00, function 90, vendor 8086, device 1209 FOUND at bus 0x00000000, devfn 0x00000090 at reg 0x00000010 ioaddr is 0x80000000 at reg 0x00000014 ioaddr is 0x00001041 After mask op ioaddr is 0x00001040 Found Intel EtherExpressPro100 82559ER at 0X1040, ROM address 0X0000 Probing...[EEPRO100]Checking to see if BIOS properly set the 82557 to be the bus master in eepro100_probe Checking if PCI latency timer is correct in eepro100_probe Ethernet addr: 00:30:59:00:98:26 Searching for server (DHCP)... Sending packets in bootp Before entering await_reply... After await_reply, before udp_transmit in bootp Before entering eth_transmit in udp_transmit Before entering eth_transmit in udp_transmit After load_configuration in main Entering load Me: 10.0.4.10, Server: 10.0.4.100 Before loading kernel in load Loading 10.0.4.100:/bproc/vmlinuz-beoboot error: not a valid image Unable to load file. <sleep> <abort> ... ... -JE ----------------------------------------------- Josh England Sandia National Laboratory, Livermore, CA Distributed Information Systems email: jj...@sa... phone: (925) 294-2076 On Mon, 2002-10-28 at 13:28, Joshua J. England wrote: > Hello, > > //** THE SETUP ** > I've got a test cluster that works using bproc-3.1.9 (RH7.2 master node) > from the March ClusterMatic CD. I'm trying to build a new master node > (RH8.0) from source using bproc-3.2.2 with beoboot-lanl.1.3. Beowulf > starts up clean. > > Nodes all boot with linuxbios, so I don't need to muck with a phase 1 > kernel. > > The phase 2 kernel was built with: > 'beoboot -2 -n -o vmlinuz-beoboot'. > > > //** THE PROBLEM ** > When a slave boots, it gets stuck in an infinte loop like such: > while (1) { > // slave issues dhpc request > // slave does arp for master -- master responds > // dhcp serves up the kernel > // new in.tftpd process starts up on master > // slave starts the tftp download and downloads a few blocks > } > > I end up with tons of tftp daemons all trying to serve a single node, > and beoserv never receives a RARP. > > This seems detached from bproc master problems --stopping beowulf > produces the same effect. > > So the question is: has anyone seen this before? What is causing the > slave to continue to issue DHCP requests after the first request > seemingly succeeds? Everything works fine when using the 3.1.9 master > node. Is this merely another SUA (Stupid User Artifact) where the > answer should be blindingly obvious? > > Thanks for any help, > > -JE > ----------------------------------------------- > Josh England > Sandia National Laboratory, Livermore, CA > Distributed Information Systems > email: jj...@sa... > phone: (925) 294-2076 > > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > BProc-users mailing list > BPr...@li... > https://lists.sourceforge.net/lists/listinfo/bproc-users |
From: Luiz O. de L. R. <ll...@st...> - 2002-10-29 18:19:56
|
Wilton, Yours aid with the PCI IDs was great and now the first and second phase are in function. However I am with another problem, is that boot of 1 phase does not obtain to load the archive of the one in the master, it looks the archive in /var/beowulf/boot.img, and despite this archive being there, it continues giving error and he does not obtain to load this image. To avoid this problem I am booting the slaves directly with the image of the second phase. Another one doubts is in relation to the access, in diverse applications is used rsh to have access in the slaves, however these images of boot does not load this service, as I must proceed then, in the case my greater doubts is how much to the Lam-mpi, as I will make it to function with the these in the slaves of the Beowulf. The third doubts is how much the GLIBC, I compiled and installed the Beoboot with the GLIBC without alteration, therefore I read in the ReleaseNotes archive the following one: "Beoboot no longer requires a modified C library. The dynamic linker patch was only for demand loading of libraries which we're trying to get away from." I must really compile the original GLIBC or I can continue using libs? Thanks o lot! Luiz Otavio ---------- Original Message in Brasilian Potuguese -------- Wilton, Sua ajuda com os PCI IDs foi fundamental e agora tanto o boot da primeira fase como o de 2 fase estao funcionado. Entretanto estou com outro problema, e que o boot de 1 fase nao consegue carregar o arquivo do no mestre, ele procura o arquivo em /var/beowulf/boot.img, e apesar desse arquivo estar la, ele continua dando erro e nao consegue carregar essa imagem. Para resolver esse problema eu estou bootando os nos escravos diretamente com a imagem da segunda fase. Outra duvida e em relacao ao acesso, em diversas aplicacoes e utilizado o rsh para acessar os nos escravos, entretanto essas imagens de boot nao carregam esse servico, como devo proceder entao, no caso minha maior duvida e quanto ao LAM-MPI, como farei ela funcionar com o esses nos escravos do Beowulf. Uma terceira duvida e quanto a GLIBC, eu compilei e instalei o Beoboot com a GLIBC sem alteracao, pois li no arquivo ReleaseNotes o seguinte: "Beoboot no longer requires a modified C library. The dynamic linker patch was only for demand loading of libraries which we're trying to get away from." Eu realmente tenho que recompilar a GLIBC alterdada ou poso utilizar a original? Muito obrigado, Luiz Otavio -----Mensagem original----- De: bpr...@li... [mailto:bpr...@li...]Em nome de Wilton Wong Enviada em: sexta-feira, 25 de outubro de 2002 03:45 Para: Luiz Otavio de Lima Rodrigues Cc: 'bpr...@li...' Assunto: Re: RES: [BProc] Beoboot troubles On Thu, 24 Oct 2002, Luiz Otavio de Lima Rodrigues wrote: > However, I did not find a line as > > pci 0x10ec 0x8129 8139too > > Where I obtain information on these configurations? These id's can be extracted from a running system using the "lspci" command as well beoboot will list out the pci id's on the console if it fails to load a module for the network. Default driver information can be extracted from /usr/share/hwdata/pcitable if you have the hwdata rpm installedi, be aware that the driver listed in this file may not always be the best driver to use on you particular version of hardware, for example there are two different drivers one can use for a sym53c8xx scsi card, the default "sym53c8xx.o" hasn't worked very well for us causeing all sorts of problems with DMA, but if we use the "sym53c8xx_2.o" driver everything works flawlessly. Also a current list of pci ids can be obtained from: http://pciids.sourceforge.net/ - Wilton ----[ Wilton William Wong ]--------------------------------------------- 11060-166 Avenue Ph : 01-780-456-9771 High Performance UNIX Edmonton, Alberta FAX: 01-780-456-9772 and Linux Solutions T5X 1Y3, Canada URL: http://www.harddata.com -------------------------------------------------------[ Hard Data Ltd. ]---- ------------------------------------------------------- This sf.net email is sponsored by: Influence the future of Java(TM) technology. Join the Java Community Process(SM) (JCP(SM)) program now. http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en _______________________________________________ BProc-users mailing list BPr...@li... https://lists.sourceforge.net/lists/listinfo/bproc-users |
From: Joshua J. E. <jj...@sa...> - 2002-10-28 23:20:07
|
These guys are PC/104 nodes using Advanced Digital Logic smartcore MSMP5SEN/SEV CPU's. I haven't tried testing memory, but I've got three identical nodes and they all behave the same. It's down to the point where I can use one kernel and make it work, or use my new one which doesn't work. Unfortunately, I don't have the config for the kernel that works. You think there might be some kernel magic to boot these chips? Oh Minnich, heed my call! :) I'll keep at it. -JE Here's is my .config: # # Automatically generated make config: don't edit # CONFIG_X86=y CONFIG_ISA=y CONFIG_UID16=y # # Code maturity level options # CONFIG_EXPERIMENTAL=y # # Loadable module support # CONFIG_MODULES=y CONFIG_KMOD=y # # Processor type and features # CONFIG_MPENTIUMIII=y CONFIG_X86_WP_WORKS_OK=y CONFIG_X86_INVLPG=y CONFIG_X86_CMPXCHG=y CONFIG_X86_XADD=y CONFIG_X86_BSWAP=y CONFIG_X86_POPAD_OK=y CONFIG_RWSEM_XCHGADD_ALGORITHM=y CONFIG_X86_L1_CACHE_SHIFT=5 CONFIG_X86_TSC=y CONFIG_X86_GOOD_APIC=y CONFIG_X86_PGE=y CONFIG_X86_USE_PPRO_CHECKSUM=y CONFIG_X86_MCE=y CONFIG_HIGHMEM4G=y CONFIG_HIGHMEM=y CONFIG_SMP=y CONFIG_HAVE_DEC_LOCK=y # # General setup # CONFIG_NET=y CONFIG_X86_IO_APIC=y CONFIG_X86_LOCAL_APIC=y CONFIG_PCI=y CONFIG_PCI_GOANY=y CONFIG_PCI_BIOS=y CONFIG_PCI_DIRECT=y CONFIG_PCI_NAMES=y CONFIG_HOTPLUG=y # # PCMCIA/CardBus support # CONFIG_PCMCIA=y CONFIG_CARDBUS=y # # PCI Hotplug Support # CONFIG_SYSVIPC=y CONFIG_BPROC=y CONFIG_SYSCTL=y CONFIG_KCORE_ELF=y CONFIG_BINFMT_AOUT=y CONFIG_BINFMT_ELF=y CONFIG_BINFMT_MISC=y CONFIG_PM=y CONFIG_SOFTWARE_SUSPEND=y CONFIG_ACPI=y CONFIG_ACPI_BUSMGR=y CONFIG_ACPI_SYS=y CONFIG_ACPI_CPU=y CONFIG_ACPI_BUTTON=y CONFIG_ACPI_AC=y CONFIG_ACPI_EC=y CONFIG_ACPI_CMBATT=y CONFIG_ACPI_THERMAL=y CONFIG_APM=y CONFIG_APM_DO_ENABLE=y CONFIG_APM_CPU_IDLE=y CONFIG_APM_RTC_IS_GMT=y CONFIG_APM_REAL_MODE_POWER_OFF=y # # Plug and Play configuration # CONFIG_PNP=y CONFIG_ISAPNP=y # # Block devices # CONFIG_BLK_DEV_FD=y CONFIG_BLK_DEV_LOOP=m CONFIG_BLK_DEV_NBD=m CONFIG_BLK_DEV_RAM=y CONFIG_BLK_DEV_RAM_SIZE=8192 CONFIG_BLK_DEV_INITRD=y # # Networking options # CONFIG_PACKET=y CONFIG_FILTER=y CONFIG_UNIX=y CONFIG_INET=y CONFIG_IP_MULTICAST=y CONFIG_IP_PNP=y CONFIG_IP_PNP_DHCP=y CONFIG_IP_PNP_RARP=y # # ATA/IDE/MFM/RLL support # CONFIG_IDE=y # # IDE, ATA and ATAPI Block devices # CONFIG_BLK_DEV_IDE=y # # Please see Documentation/ide.txt for help/info on IDE drives # CONFIG_BLK_DEV_IDEDISK=y CONFIG_IDEDISK_MULTI_MODE=y CONFIG_BLK_DEV_IDECS=m CONFIG_BLK_DEV_IDECD=y CONFIG_BLK_DEV_IDEFLOPPY=m CONFIG_BLK_DEV_IDESCSI=m # # IDE chipset support/bugfixes # CONFIG_BLK_DEV_CMD640=y CONFIG_BLK_DEV_RZ1000=y CONFIG_BLK_DEV_IDEPCI=y CONFIG_IDEPCI_SHARE_IRQ=y CONFIG_BLK_DEV_IDEDMA_PCI=y CONFIG_IDEDMA_PCI_AUTO=y CONFIG_BLK_DEV_IDEDMA=y CONFIG_BLK_DEV_ADMA=y CONFIG_BLK_DEV_PIIX=y CONFIG_PIIX_TUNING=y CONFIG_IDEDMA_AUTO=y CONFIG_BLK_DEV_IDE_MODES=y # # SCSI support # CONFIG_SCSI=y # # SCSI support type (disk, tape, CD-ROM) # CONFIG_BLK_DEV_SD=y CONFIG_SD_EXTRA_DEVS=40 CONFIG_BLK_DEV_SR=m CONFIG_SR_EXTRA_DEVS=2 CONFIG_CHR_DEV_SG=m # # Some SCSI devices (e.g. CD jukebox) support multiple LUNs # CONFIG_SCSI_DEBUG_QUEUES=y CONFIG_SCSI_MULTI_LUN=y CONFIG_SCSI_CONSTANTS=y # # SCSI low-level drivers # CONFIG_SCSI_SYM53C8XX=y CONFIG_SCSI_NCR53C8XX_DEFAULT_TAGS=4 CONFIG_SCSI_NCR53C8XX_MAX_TAGS=32 CONFIG_SCSI_NCR53C8XX_SYNC=20 # # Network device support # CONFIG_NETDEVICES=y # # ARCnet devices # CONFIG_DUMMY=m # # Ethernet (10 or 100Mbit) # CONFIG_NET_ETHERNET=y CONFIG_NET_PCI=y CONFIG_EEPRO100=m # # Ethernet (1000 Mbit) # CONFIG_PPP=m # # Wireless LAN (non-hamradio) # CONFIG_NET_RADIO=y CONFIG_HERMES=m # # Wireless Pcmcia cards support # CONFIG_PCMCIA_HERMES=m CONFIG_NET_WIRELESS=y # # PCMCIA network device support # CONFIG_NET_PCMCIA=y CONFIG_PCMCIA_3C589=m CONFIG_PCMCIA_3C574=m CONFIG_PCMCIA_FMVJ18X=m CONFIG_PCMCIA_PCNET=m CONFIG_PCMCIA_AXNET=m CONFIG_PCMCIA_NMCLAN=m CONFIG_PCMCIA_SMC91C92=m CONFIG_PCMCIA_XIRC2PS=m CONFIG_PCMCIA_XIRCOM=m CONFIG_PCMCIA_XIRTULIP=m CONFIG_NET_PCMCIA_RADIO=y CONFIG_PCMCIA_RAYCS=m CONFIG_PCMCIA_NETWAVE=m CONFIG_PCMCIA_WAVELAN=m # # Input core support # CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024 CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768 # # Character devices # CONFIG_VT=y CONFIG_VT_CONSOLE=y CONFIG_SERIAL=y CONFIG_SERIAL_CONSOLE=y CONFIG_UNIX98_PTYS=y CONFIG_UNIX98_PTY_COUNT=256 # # Mice # CONFIG_MOUSE=y CONFIG_PSMOUSE=y # # Ftape, the floppy tape device driver # CONFIG_AGP=y CONFIG_AGP_INTEL=y CONFIG_AGP_I810=y CONFIG_AGP_VIA=y CONFIG_AGP_AMD=y CONFIG_AGP_SIS=y CONFIG_AGP_ALI=y CONFIG_DRM=y # # DRM 4.1 drivers # CONFIG_DRM_NEW=y CONFIG_DRM_TDFX=y CONFIG_DRM_RADEON=y # # File systems # CONFIG_AUTOFS4_FS=y CONFIG_REISERFS_FS=m CONFIG_EXT3_FS=y CONFIG_JBD=y CONFIG_FAT_FS=m CONFIG_MSDOS_FS=m CONFIG_UMSDOS_FS=m CONFIG_VFAT_FS=m CONFIG_CRAMFS=m CONFIG_TMPFS=y CONFIG_RAMFS=y CONFIG_ISO9660_FS=y CONFIG_JOLIET=y CONFIG_MINIX_FS=m CONFIG_NTFS_FS=m CONFIG_PROC_FS=y CONFIG_DEVPTS_FS=y CONFIG_ROMFS_FS=m CONFIG_EXT2_FS=y # # Network File Systems # CONFIG_NFS_FS=y CONFIG_NFS_V3=y CONFIG_NFSD=y CONFIG_NFSD_V3=y CONFIG_SUNRPC=y CONFIG_LOCKD=y CONFIG_LOCKD_V4=y CONFIG_ZLIB_FS_INFLATE=m # # Partition Types # CONFIG_MSDOS_PARTITION=y CONFIG_NLS=y # # Native Language Support # CONFIG_NLS_DEFAULT="iso8859-1" # # Console drivers # CONFIG_VGA_CONSOLE=y CONFIG_VIDEO_SELECT=y # # Frame-buffer support # CONFIG_FB=y CONFIG_DUMMY_CONSOLE=y CONFIG_FB_VESA=y CONFIG_VIDEO_SELECT=y CONFIG_FBCON_CFB8=y CONFIG_FBCON_CFB16=y CONFIG_FBCON_CFB24=y CONFIG_FBCON_CFB32=y CONFIG_FONT_8x8=y CONFIG_FONT_8x16=y # # Sound # CONFIG_SOUND=y CONFIG_SOUND_ES1371=m CONFIG_SOUND_ICH=m # # USB support # CONFIG_USB=m # # USB Host Controller Drivers # CONFIG_USB_UHCI_ALT=m # # USB Device Class drivers # CONFIG_USB_STORAGE=m # # Kernel hacking # CONFIG_DEBUG_KERNEL=y CONFIG_MAGIC_SYSRQ=y ----------------------------------------------- Josh England Sandia National Laboratory, Livermore, CA Distributed Information Systems email: jj...@sa... phone: (925) 294-2076 On Mon, 2002-10-28 at 12:21, steven james wrote: > Greetings, > > I've seen similar when a network card of driver had problems. It can also > happen if memory isn't right. Have you tried running memtest86 on it > (Recent memtest86 build with an elf image suitable for netbooting in > etherboot). > > Which chipset/mainboard? > > G'day, > sjames > > > On 28 Oct 2002, Joshua J. England wrote: > > > Hello, > > > > //** THE SETUP ** > > I've got a test cluster that works using bproc-3.1.9 (RH7.2 master node) > > from the March ClusterMatic CD. I'm trying to build a new master node > > (RH8.0) from source using bproc-3.2.2 with beoboot-lanl.1.3. Beowulf > > starts up clean. > > > > Nodes all boot with linuxbios, so I don't need to muck with a phase 1 > > kernel. > > > > The phase 2 kernel was built with: > > 'beoboot -2 -n -o vmlinuz-beoboot'. > > > > > > //** THE PROBLEM ** > > When a slave boots, it gets stuck in an infinte loop like such: > > while (1) { > > // slave issues dhpc request > > // slave does arp for master -- master responds > > // dhcp serves up the kernel > > // new in.tftpd process starts up on master > > // slave starts the tftp download and downloads a few blocks > > } > > > > I end up with tons of tftp daemons all trying to serve a single node, > > and beoserv never receives a RARP. > > > > This seems detached from bproc master problems --stopping beowulf > > produces the same effect. > > > > So the question is: has anyone seen this before? What is causing the > > slave to continue to issue DHCP requests after the first request > > seemingly succeeds? Everything works fine when using the 3.1.9 master > > node. Is this merely another SUA (Stupid User Artifact) where the > > answer should be blindingly obvious? > > > > Thanks for any help, > > > > -JE > > ----------------------------------------------- > > Josh England > > Sandia National Laboratory, Livermore, CA > > Distributed Information Systems > > email: jj...@sa... > > phone: (925) 294-2076 > > > > > > > > > > ------------------------------------------------------- > > This sf.net email is sponsored by:ThinkGeek > > Welcome to geek heaven. > > http://thinkgeek.com/sf > > _______________________________________________ > > BProc-users mailing list > > BPr...@li... > > https://lists.sourceforge.net/lists/listinfo/bproc-users > > > > -- > -------------------------steven james, director of research, linux labs > ... ........ ..... .... 230 peachtree st nw ste 701 > the original linux labs atlanta.ga.us 30303 > -since 1995 http://www.linuxlabs.com > office 404.577.7747 fax 404.577.7743 > ----------------------------------------------------------------------- > > |
From: steven j. <py...@li...> - 2002-10-28 22:22:16
|
Greetings, I've seen similar when a network card of driver had problems. It can also happen if memory isn't right. Have you tried running memtest86 on it (Recent memtest86 build with an elf image suitable for netbooting in etherboot). Which chipset/mainboard? G'day, sjames On 28 Oct 2002, Joshua J. England wrote: > Hello, > > //** THE SETUP ** > I've got a test cluster that works using bproc-3.1.9 (RH7.2 master node) > from the March ClusterMatic CD. I'm trying to build a new master node > (RH8.0) from source using bproc-3.2.2 with beoboot-lanl.1.3. Beowulf > starts up clean. > > Nodes all boot with linuxbios, so I don't need to muck with a phase 1 > kernel. > > The phase 2 kernel was built with: > 'beoboot -2 -n -o vmlinuz-beoboot'. > > > //** THE PROBLEM ** > When a slave boots, it gets stuck in an infinte loop like such: > while (1) { > // slave issues dhpc request > // slave does arp for master -- master responds > // dhcp serves up the kernel > // new in.tftpd process starts up on master > // slave starts the tftp download and downloads a few blocks > } > > I end up with tons of tftp daemons all trying to serve a single node, > and beoserv never receives a RARP. > > This seems detached from bproc master problems --stopping beowulf > produces the same effect. > > So the question is: has anyone seen this before? What is causing the > slave to continue to issue DHCP requests after the first request > seemingly succeeds? Everything works fine when using the 3.1.9 master > node. Is this merely another SUA (Stupid User Artifact) where the > answer should be blindingly obvious? > > Thanks for any help, > > -JE > ----------------------------------------------- > Josh England > Sandia National Laboratory, Livermore, CA > Distributed Information Systems > email: jj...@sa... > phone: (925) 294-2076 > > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > BProc-users mailing list > BPr...@li... > https://lists.sourceforge.net/lists/listinfo/bproc-users > -- -------------------------steven james, director of research, linux labs ... ........ ..... .... 230 peachtree st nw ste 701 the original linux labs atlanta.ga.us 30303 -since 1995 http://www.linuxlabs.com office 404.577.7747 fax 404.577.7743 ----------------------------------------------------------------------- |
From: Joshua J. E. <jj...@sa...> - 2002-10-28 21:30:56
|
Hello, //** THE SETUP ** I've got a test cluster that works using bproc-3.1.9 (RH7.2 master node) from the March ClusterMatic CD. I'm trying to build a new master node (RH8.0) from source using bproc-3.2.2 with beoboot-lanl.1.3. Beowulf starts up clean. Nodes all boot with linuxbios, so I don't need to muck with a phase 1 kernel. The phase 2 kernel was built with: 'beoboot -2 -n -o vmlinuz-beoboot'. //** THE PROBLEM ** When a slave boots, it gets stuck in an infinte loop like such: while (1) { // slave issues dhpc request // slave does arp for master -- master responds // dhcp serves up the kernel // new in.tftpd process starts up on master // slave starts the tftp download and downloads a few blocks } I end up with tons of tftp daemons all trying to serve a single node, and beoserv never receives a RARP. This seems detached from bproc master problems --stopping beowulf produces the same effect. So the question is: has anyone seen this before? What is causing the slave to continue to issue DHCP requests after the first request seemingly succeeds? Everything works fine when using the 3.1.9 master node. Is this merely another SUA (Stupid User Artifact) where the answer should be blindingly obvious? Thanks for any help, -JE ----------------------------------------------- Josh England Sandia National Laboratory, Livermore, CA Distributed Information Systems email: jj...@sa... phone: (925) 294-2076 |
From: Doug <dos...@mc...> - 2002-10-28 19:44:34
|
Is ther any documentation (i.e. howto's) on how to setup a cluster to use this? Or is the only documentation included with the source? Thanks, Doug |
From: MIDN S. J. <m0...@us...> - 2002-10-26 17:57:28
|
I'm trying to compile bproc 3.2.2 on a RH 8.0 system with a custom kernel and a patched glibc. During compilation of hooks.c I receive multiple errors similar to this: hooks.c:777: `bproc_hook_do_execve_hook' undeclared (first use in this function) I've looked through the code and the only occurrences I have found of any functions with similar names is in hooks.c. I don't have access to a older machine so I have been unable to test whether it is just a issue with the libraries in RH 8.0. Thank you, Sean Jones MIDN USN -- ============================================================================== /\ | Sean Jones / \ _ __ __| __ MIDN USN /====\ |/ \ /\/\ / | / | / | m0...@us... / \ | | | \_/| \_/| \_/| United States Naval Academy Annapolis, MD 21412 ============================================================================== |
From: steven j. <py...@li...> - 2002-10-25 16:57:46
|
Greetings, Agreed, PXE has it's limits. I usually avoid them by doing a sequenced power up (nicer to the mains as well). I'm considering getting LinuxBIOS/Etherboot to do multicast. G'day, sjames On Fri, 25 Oct 2002, Wilton Wong wrote: > > On Fri, 25 Oct 2002, steven james wrote: > > > For a test, it might be worth trying either NFS root mount, hda1, or even > > a floppy as a root filesystem to eliminate the ramdisk as a problem. > > I also use PXE here and we only run into problems when we boot up 50+ nodes at > the same time, the master node can't handle the network stress and PXE is not > smart enough to handel it ;) > > - Wilton > > ----[ Wilton William Wong ]--------------------------------------------- > 11060-166 Avenue Ph : 01-780-456-9771 High Performance UNIX > Edmonton, Alberta FAX: 01-780-456-9772 and Linux Solutions > T5X 1Y3, Canada URL: http://www.harddata.com > -------------------------------------------------------[ Hard Data Ltd. ]---- > -- -------------------------steven james, director of research, linux labs ... ........ ..... .... 230 peachtree st nw ste 701 the original linux labs atlanta.ga.us 30303 -since 1995 http://www.linuxlabs.com office 404.577.7747 fax 404.577.7743 ----------------------------------------------------------------------- |
From: Wilton W. <ww...@ha...> - 2002-10-25 16:52:37
|
On Fri, 25 Oct 2002, steven james wrote: > For a test, it might be worth trying either NFS root mount, hda1, or even > a floppy as a root filesystem to eliminate the ramdisk as a problem. I also use PXE here and we only run into problems when we boot up 50+ nodes at the same time, the master node can't handle the network stress and PXE is not smart enough to handel it ;) - Wilton ----[ Wilton William Wong ]--------------------------------------------- 11060-166 Avenue Ph : 01-780-456-9771 High Performance UNIX Edmonton, Alberta FAX: 01-780-456-9772 and Linux Solutions T5X 1Y3, Canada URL: http://www.harddata.com -------------------------------------------------------[ Hard Data Ltd. ]---- |
From: steven j. <py...@li...> - 2002-10-25 12:53:23
|
Greetings, I usually use PXE boot on those boards, so my experiance isn't directly related, but I have seen similar behaviour when the initrd has a CRC error. For a test, it might be worth trying either NFS root mount, hda1, or even a floppy as a root filesystem to eliminate the ramdisk as a problem. G'day, sjames On Thu, 24 Oct 2002, Thomas Clausen wrote: > Hi, > > I'm trying to install bproc 3.2.1 and beoboot 1.3 on a cluster of dual CPU > Tyan 2466 running kernel 2.4.19. > > The master comes up fine. The nodes freeze while booting > /var/beowulf/boot.img > after > > . > . > . > mounting root fs > freeing unused kernel memory (260K) > > There are no error messages. > > This is the first time I try to install bproc from src so I might be doing > something wrong when building the kernel. Can anyone offer a hint? > > Sincerely, Thomas > > -- -------------------------steven james, director of research, linux labs ... ........ ..... .... 230 peachtree st nw ste 701 the original linux labs atlanta.ga.us 30303 -since 1995 http://www.linuxlabs.com office 404.577.7747 fax 404.577.7743 ----------------------------------------------------------------------- |
From: Wilton W. <ww...@ha...> - 2002-10-25 07:06:43
|
Hmm.. we are booting clusters full of 2466N's and we don't see the same problem that you do, the only problem in booting we have seen is that a non-SMP athlon kernel will NOT work on an SMP machine it just pukes with APIC errors.. is it possible that you have set console to serial ? - Wilton On Thu, 24 Oct 2002, Thomas Clausen wrote: > Hi, > > I'm trying to install bproc 3.2.1 and beoboot 1.3 on a cluster of dual CPU > Tyan 2466 running kernel 2.4.19. > > The master comes up fine. The nodes freeze while booting > /var/beowulf/boot.img > after > > . > . > . > mounting root fs > freeing unused kernel memory (260K) > > There are no error messages. > > This is the first time I try to install bproc from src so I might be doing > something wrong when building the kernel. Can anyone offer a hint? > > Sincerely, Thomas ----[ Wilton William Wong ]--------------------------------------------- 11060-166 Avenue Ph : 01-780-456-9771 High Performance UNIX Edmonton, Alberta FAX: 01-780-456-9772 and Linux Solutions T5X 1Y3, Canada URL: http://www.harddata.com -------------------------------------------------------[ Hard Data Ltd. ]---- |
From: Wilton W. <ww...@ha...> - 2002-10-25 06:45:12
|
On Thu, 24 Oct 2002, Luiz Otavio de Lima Rodrigues wrote: > However, I did not find a line as > > pci 0x10ec 0x8129 8139too > > Where I obtain information on these configurations? These id's can be extracted from a running system using the "lspci" command as well beoboot will list out the pci id's on the console if it fails to load a module for the network. Default driver information can be extracted from /usr/share/hwdata/pcitable if you have the hwdata rpm installedi, be aware that the driver listed in this file may not always be the best driver to use on you particular version of hardware, for example there are two different drivers one can use for a sym53c8xx scsi card, the default "sym53c8xx.o" hasn't worked very well for us causeing all sorts of problems with DMA, but if we use the "sym53c8xx_2.o" driver everything works flawlessly. Also a current list of pci ids can be obtained from: http://pciids.sourceforge.net/ - Wilton ----[ Wilton William Wong ]--------------------------------------------- 11060-166 Avenue Ph : 01-780-456-9771 High Performance UNIX Edmonton, Alberta FAX: 01-780-456-9772 and Linux Solutions T5X 1Y3, Canada URL: http://www.harddata.com -------------------------------------------------------[ Hard Data Ltd. ]---- |
From: Thomas C. <tcl...@we...> - 2002-10-24 14:33:01
|
Hi, I'm trying to install bproc 3.2.1 and beoboot 1.3 on a cluster of dual CPU Tyan 2466 running kernel 2.4.19. The master comes up fine. The nodes freeze while booting /var/beowulf/boot.img after . . . mounting root fs freeing unused kernel memory (260K) There are no error messages. This is the first time I try to install bproc from src so I might be doing something wrong when building the kernel. Can anyone offer a hint? Sincerely, Thomas -- .^. Thomas Clausen, graduate student /V\ Physics Department, Wesleyan University, CT // \\ Tel 860-685-2018, fax 860-685-2031 /( )\ ^^-^^ Use Linux |
From: Luiz O. de L. R. <ll...@st...> - 2002-10-24 14:03:32
|
Hi Wilton, Well, mine config.boot this standard, and in it I found the following lines: bootmodule 3c59x acenic eepro100 hamachi natsemi dfme bootmodule ne2k-pci pcnet 8139too sis900 sk98lin starfire de4x5 bootmodule tlan tulip way-rhine winbond-840 yellowfin However, I did not find a line as pci 0x10ec 0x8129 8139too Where I obtain information on these configurations? Thank you, Luiz Otavio ---------------- Original Message ------------------- Bem, meu config.boot esta padrao, e nele eu encontrei as seguintes linhas: bootmodule 3c59x acenic eepro100 hamachi natsemi dfme bootmodule ne2k-pci pcnet 8139too sis900 sk98lin starfire de4x5 bootmodule tlan tulip via-rhine winbond-840 yellowfin Entretanto, nao encontrei nunhuma linha como pci 0x10ec 0x8129 8139too Aonde consigo informacao sobre essas configuracoes? Obrigado, Luiz Otavio -----Mensagem original----- De: bpr...@li... [mailto:bpr...@li...]Em nome de Wilton Wong Enviada em: quarta-feira, 23 de outubro de 2002 21:10 Para: Luiz Otavio de Lima Rodrigues Cc: 'bpr...@li...' Assunto: Re: [BProc] Beoboot troubles Hi Luiz, Did you add the modules for the realtek and eepro100 modules as well as their corresponding pci id's into /etc/beowulf/config.boot and regenerate the phase 1/2 images ? <EXAMPLE lines for config.boot> bootmodule 8139too bootmodule eepro100 # RealTek 8139 Clones pci 0x10ec 0x8129 8139too pci 0x10ec 0x8138 8139too pci 0x10ec 0x8139 8139too pci 0x1113 0x1211 8139too pci 0x1186 0x1300 8139too pci 0x1186 0x1340 8139too pci 0x13d1 0xab06 8139too pci 0x4033 0x1360 8139too #Intel eePRO100 pci 0x1014 0x005c eepro100 pci 0x10c3 0x1100 eepro100 pci 0x1259 0x2560 eepro100 pci 0x1266 0x0001 eepro100 pci 0x8086 0x1029 eepro100 pci 0x8086 0x1030 eepro100 pci 0x8086 0x1031 eepro100 pci 0x8086 0x1032 eepro100 pci 0x8086 0x1033 eepro100 pci 0x8086 0x1034 eepro100 pci 0x8086 0x1035 eepro100 pci 0x8086 0x1036 eepro100 pci 0x8086 0x1037 eepro100 pci 0x8086 0x1038 eepro100 pci 0x8086 0x1209 eepro100 pci 0x8086 0x1227 eepro100 pci 0x8086 0x1228 eepro100 pci 0x8086 0x1229 eepro100 pci 0x8086 0x2449 eepro100 pci 0x8086 0x5200 eepro100 pci 0x8086 0x5201 eepro100 </EXAMPLE> - Wilton On Wed, 23 Oct 2002, Luiz Otavio de Lima Rodrigues wrote: > > Hi people, > > I am using beoboot-lanl.1.3 and I perfectly installed it in my RedHat 8.0 > with kernel 2.4.19, with path of bproc 3.2.,1 applied. ----[ Wilton William Wong ]--------------------------------------------- 11060-166 Avenue Ph : 01-780-456-9771 High Performance UNIX Edmonton, Alberta FAX: 01-780-456-9772 and Linux Solutions T5X 1Y3, Canada URL: http://www.harddata.com -------------------------------------------------------[ Hard Data Ltd. ]---- ------------------------------------------------------- This sf.net email is sponsored by: Influence the future of Java(TM) technology. Join the Java Community Process(SM) (JCP(SM)) program now. http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0002en _______________________________________________ BProc-users mailing list BPr...@li... https://lists.sourceforge.net/lists/listinfo/bproc-users |
From: Wilton W. <ww...@ha...> - 2002-10-24 00:10:11
|
Hi Luiz, Did you add the modules for the realtek and eepro100 modules as well as their corresponding pci id's into /etc/beowulf/config.boot and regenerate the phase 1/2 images ? <EXAMPLE lines for config.boot> bootmodule 8139too bootmodule eepro100 # RealTek 8139 Clones pci 0x10ec 0x8129 8139too pci 0x10ec 0x8138 8139too pci 0x10ec 0x8139 8139too pci 0x1113 0x1211 8139too pci 0x1186 0x1300 8139too pci 0x1186 0x1340 8139too pci 0x13d1 0xab06 8139too pci 0x4033 0x1360 8139too #Intel eePRO100 pci 0x1014 0x005c eepro100 pci 0x10c3 0x1100 eepro100 pci 0x1259 0x2560 eepro100 pci 0x1266 0x0001 eepro100 pci 0x8086 0x1029 eepro100 pci 0x8086 0x1030 eepro100 pci 0x8086 0x1031 eepro100 pci 0x8086 0x1032 eepro100 pci 0x8086 0x1033 eepro100 pci 0x8086 0x1034 eepro100 pci 0x8086 0x1035 eepro100 pci 0x8086 0x1036 eepro100 pci 0x8086 0x1037 eepro100 pci 0x8086 0x1038 eepro100 pci 0x8086 0x1209 eepro100 pci 0x8086 0x1227 eepro100 pci 0x8086 0x1228 eepro100 pci 0x8086 0x1229 eepro100 pci 0x8086 0x2449 eepro100 pci 0x8086 0x5200 eepro100 pci 0x8086 0x5201 eepro100 </EXAMPLE> - Wilton On Wed, 23 Oct 2002, Luiz Otavio de Lima Rodrigues wrote: > > Hi people, > > I am using beoboot-lanl.1.3 and I perfectly installed it in my RedHat 8.0 > with kernel 2.4.19, with path of bproc 3.2.,1 applied. ----[ Wilton William Wong ]--------------------------------------------- 11060-166 Avenue Ph : 01-780-456-9771 High Performance UNIX Edmonton, Alberta FAX: 01-780-456-9772 and Linux Solutions T5X 1Y3, Canada URL: http://www.harddata.com -------------------------------------------------------[ Hard Data Ltd. ]---- |
From: Luiz O. de L. R. <ll...@st...> - 2002-10-23 20:29:41
|
Hi people, I am using beoboot-lanl.1.3 and I perfectly installed it in my RedHat 8.0 with kernel 2.4.19, with path of bproc 3.2.,1 applied. I normally generate phase 1 and phase 2 disketes. My problem is when carrying through boot in the slave nodes, them does not recognize the nic. Anyone could help me with this small problem? The kernel was compiled with network drivers in the kernel, and my network cards are Realtek and Intel Pro 10/100 thank you, Luiz Otavio UCB - Brasilia - DF ------ Original Message in Brazilian Portuguese ------- Ola, estou utilizando o beoboot-lanl.1.3 e ja instalei-o perfeitamente em meu RedHat 8.0 com kernel 2.4.19, com path do bproc 3.2.1 aplicado. Consigo gerar normalmente o diskete fase 1 e fase 2. Meu problema e ao realizar o boot nos nos escravos, eles nao reconhecem a placa de rede. Alguem poderia me ajudar com esse pequeno problema? Meu kernel foi compilado com os drivers das placas de redes no proprio kernel, e minha placas de redes sao Realtek e Intel Pro 10/100 Obrigado, Luiz Otavio UCB - Brasilia - DF |
From: <er...@he...> - 2002-10-23 19:51:04
|
On Wed, Oct 23, 2002 at 02:17:57PM -0400, Nicholas Henke wrote: > On Wed, 23 Oct 2002 13:32:42 -0400 > er...@he... wrote: > > > On Mon, Oct 21, 2002 at 10:06:44AM -0400, Nicholas Henke wrote: > > > Hrm -- I seems to be breaking things again. The following oops > > > occurred running my 'noop' script again -- the ps script is that > > > script, just remove the bpsh $node ps line. I have seen this oops > > > twice so far -- the Code: was the same in both. If the trace is > > > bogus, I would _really_ appreciate any pointers you could give me to > > > get better traces for you. > > > > Ok, here we go again. I saw an oops. It might be the oops you saw. > > I think the odds are decent that it's the same one. I fixed it (I > > think) and the slave nodes that were crashing with your script don't > > seem to be crashing anymore. So give this one a whirl and let me know > > how it goes. > > > > You can either try the patch (for kernel/slave.c) included in this > > email (in addition to all the other ones). Or just grab BProc 3.2.2 > > which I just threw up on SourceForge. Oh, and I fixed that silly > > VERSION:= build problem. > > Sounds good -- btw, do you have any updated docs for the bproc_* C api? > I am working on a python wrapper for the newer calls. I'm afraid not. Feel free to ask questions if the examples in things like bpsh and mpirun are insufficient. - Erik |
From: <er...@he...> - 2002-10-23 19:45:27
|
On Wed, Oct 23, 2002 at 03:33:55PM -0400, Nicholas Henke wrote: > On Wed, 23 Oct 2002 13:34:47 -0400 > er...@he... wrote: > > Is is safe/sane to run bproc 3.2.2 against a 2.4.18 kernel with the > procfs locking fix you put here? I have not had good luck with 2.4.19 > yet. yeah, it should be. I just don't have the time to test and maintain against lots of different kernels. - Erik |
From: Nicholas H. <he...@se...> - 2002-10-23 19:33:37
|
On Wed, 23 Oct 2002 13:34:47 -0400 er...@he... wrote: Is is safe/sane to run bproc 3.2.2 against a 2.4.18 kernel with the procfs locking fix you put here? I have not had good luck with 2.4.19 yet. Nic -- Nicholas Henke Linux Cluster Systems Programmer |
From: Nicholas H. <he...@se...> - 2002-10-23 18:17:40
|
On Wed, 23 Oct 2002 13:32:42 -0400 er...@he... wrote: > On Mon, Oct 21, 2002 at 10:06:44AM -0400, Nicholas Henke wrote: > > Hrm -- I seems to be breaking things again. The following oops > > occurred running my 'noop' script again -- the ps script is that > > script, just remove the bpsh $node ps line. I have seen this oops > > twice so far -- the Code: was the same in both. If the trace is > > bogus, I would _really_ appreciate any pointers you could give me to > > get better traces for you. > > Ok, here we go again. I saw an oops. It might be the oops you saw. > I think the odds are decent that it's the same one. I fixed it (I > think) and the slave nodes that were crashing with your script don't > seem to be crashing anymore. So give this one a whirl and let me know > how it goes. > > You can either try the patch (for kernel/slave.c) included in this > email (in addition to all the other ones). Or just grab BProc 3.2.2 > which I just threw up on SourceForge. Oh, and I fixed that silly > VERSION:= build problem. Sounds good -- btw, do you have any updated docs for the bproc_* C api? I am working on a python wrapper for the newer calls. Nic |
From: <er...@he...> - 2002-10-23 17:53:49
|
BProc 3.2.2 is up on sourceforge in the usual location: http://sourceforge.net/project/showfiles.php?group_id=24453&release_id=118027 Here are the release notes and change log: 3.2.2 ---------------------------------------------------------------------- This release has a number of important bug fixes. There are also some build time fixes for gcc 3.2. vrfork() robustness has also been improved somewhat. See the change log for details. Changes from 3.2.1 to 3.2.2 * Fixed build problems and warnings with gcc 3.2. * Changed rfork behavior to make BE_SAMENODE never happen. Effectively, this changes rfork() -> fork() when rforking to your current node. Note that the I/O handling isn't done in that case. (that's a bug :) * Reworked rfork slightly to handle placing processes on the current node transparently. * Fixed a procfs race with process ID mapping that could lead to kernel oopses on slave nodes. * Fixed a slave side kernel oops. The symptom was bpslave oopsing in wake_up. The cause was a semaphore was used in a spot where I should have used a completion. * Added a bit of makefile goop to allow building the kernel modules with a different compiler than the rest. (KCC=) |
From: <er...@he...> - 2002-10-23 17:51:50
|
On Mon, Oct 21, 2002 at 10:06:44AM -0400, Nicholas Henke wrote: > Hrm -- I seems to be breaking things again. The following oops occurred > running my 'noop' script again -- the ps script is that script, just > remove the bpsh $node ps line. I have seen this oops twice so far -- the > Code: was the same in both. If the trace is bogus, I would _really_ > appreciate any pointers you could give me to get better traces for you. Ok, here we go again. I saw an oops. It might be the oops you saw. I think the odds are decent that it's the same one. I fixed it (I think) and the slave nodes that were crashing with your script don't seem to be crashing anymore. So give this one a whirl and let me know how it goes. You can either try the patch (for kernel/slave.c) included in this email (in addition to all the other ones). Or just grab BProc 3.2.2 which I just threw up on SourceForge. Oh, and I fixed that silly VERSION:= build problem. - Erik diff -u -r1.47 -r1.48 --- slave.c 5 Aug 2002 22:56:10 -0000 1.47 +++ slave.c 22 Oct 2002 22:01:02 -0000 1.48 @@ -17,7 +17,7 @@ * along with this program; if not, write to the Free Software * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA * - * $Id: slave.c,v 1.47 2002/08/05 22:56:10 hendriks Exp $ + * $Id: slave.c,v 1.48 2002/10/22 22:01:02 hendriks Exp $ *-----------------------------------------------------------------------*/ #define __NO_VERSION__ @@ -53,7 +53,7 @@ ** exist) before allowing the slave daemon to continue. **/ struct recv_proc_info { - struct semaphore sem; + struct completion cmpl; struct bproc_masq_master_t *master; struct bproc_krequest_t *req; int status; @@ -75,7 +75,7 @@ /* Let the slave daemon continue now that we've camped on * these pid's and it's safe to go on doing other things * now. */ - up(&arg->sem); /* Can't use "arg" beyond this point. */ + complete(&arg->cmpl); /* Can't use "arg" beyond this point. */ if (r != 0) { bproc_put_req(req); silent_exit(); @@ -139,7 +139,7 @@ struct recv_proc_info info; info = (struct recv_proc_info) {{}, master, req, 0}; - init_MUTEX_LOCKED(&info.sem); + init_completion(&info.cmpl); /* We have to clear out PF_TRACESYS temporarily here since kernel * thread uses the system call entry/exit path. (on x86 @@ -157,7 +157,7 @@ current->ptrace |= (flags & PT_TRACESYS); if (pid < 0) return pid; - down(&info.sem); /* Wait for the process to get setup */ + wait_for_completion(&info.cmpl); /* Wait for the process to get setup */ return info.status; } |
From: Nicholas H. <he...@se...> - 2002-10-22 21:22:28
|
On Tue, 22 Oct 2002 15:33:16 -0400 Nicholas Henke <he...@se...> wrote: I seem to be having a ton of these errors on our testcluster, but not a different cluster. I have tried to find the discrepencies, but no luck. Do you have any idea what might be causing this ? $ strace mpirun -d -p --np 2 ./cpi [snip] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 nanosleep({5, 0}, {5, 0}) = 0 accept(3, {sin_family=AF_INET, sin_port=htons(33105), sin_addr=inet_addr("192.168.2.10")}}, [16]) = 4 read(4, "\3048\0\0", 4) = 4 read(4, "P\201\0\0", 4) = 4 accept(3, {sin_family=AF_INET, sin_port=htons(32976), sin_addr=inet_addr("192.168.2.11")}}, [16]) = 5 read(5, "\3058\0\0", 4) = 4 read(5, "\317\200\0\0", 4) = 4 write(1, " 0 0 14532 192.168.2.10 3"..., 37 0 0 14532 192.168.2.10 33104 ) = 37 write(1, " 1 1 14533 192.168.2.11 3"..., 37 1 1 14533 192.168.2.11 32975 ) = 37 write(4, "\2\0\0\0", 4) = 4 write(4, "\0\0\0\0", 4) = 4 write(4, "\3048\0\0", 4) = 4 write(4, "\300\250\2\n", 4) = 4 write(4, "\201P", 2) = 2 write(4, "\3058\0\0", 4) = 4 write(4, "\300\250\2\v", 4) = 4 write(4, "\200\317", 2) = 2 close(4) = 0 write(5, "\2\0\0\0", 4) = 4 write(5, "\1\0\0\0", 4) = 4 write(5, "\3048\0\0", 4) = 4 write(5, "\300\250\2\n", 4) = 4 write(5, "\201P", 2) = 2 write(5, "\3058\0\0", 4) = 4 write(5, "\300\250\2\v", 4) = 4 write(5, "\200\317", 2) = 2 close(5) = 0 wait4(-1, Process 0 on node1.internal.org [WIFSIGNALED(s) && WTERMSIG(s) == SIGINT], 0, NULL) = 14533 --- SIGCHLD (Child exited) --- write(2, "rank 1 pid=14533 exited with sig"..., 38rank 1 pid=14533 exited with signal 2 ) = 38 wait4(-1, xm_14532: p4_error: net_recv read: probable EOF on socket: 1 Connection failed for reason: : Connection refused Connection failed for reason: : Connection refused [WIFSIGNALED(s) && WTERMSIG(s) == SIGPIPE], 0, NULL) = 14532 --- SIGCHLD (Child exited) --- write(2, "rank 0 pid=14532 exited with sig"..., 39rank 0 pid=14532 exited with signal 13 ) = 39 wait4(-1, 0xbffff9f8, 0, NULL) = -1 ECHILD (No child processes) munmap(0x40017000, 4096) = 0 _exit(0) |
From: Raj K. <RKa...@nv...> - 2002-10-22 20:23:30
|
-----Original Message----- From: bpr...@li... [mailto:bpr...@li...] Sent: Tuesday, October 22, 2002 12:09 PM To: bpr...@li... Subject: BProc-users digest, Vol 1 #151 - 1 msg Send BProc-users mailing list submissions to bpr...@li... To subscribe or unsubscribe via the World Wide Web, visit https://lists.sourceforge.net/lists/listinfo/bproc-users or, via email, send a message with subject or body 'help' to bpr...@li... You can reach the person managing the list at bpr...@li... When replying, please edit your Subject line so it is more specific than "Re: Contents of BProc-users digest..." Today's Topics: 1. Re: another 3.2.0 oops (er...@he...) --__--__-- Message: 1 Date: Mon, 21 Oct 2002 15:07:42 -0400 From: er...@he... To: Nicholas Henke <he...@se...> Cc: bpr...@li... Subject: Re: [BProc] another 3.2.0 oops On Mon, Oct 21, 2002 at 10:06:44AM -0400, Nicholas Henke wrote: > Hrm -- I seems to be breaking things again. The following oops > occurred running my 'noop' script again -- the ps script is that > script, just remove the bpsh $node ps line. I have seen this oops > twice so far -- the > Code: was the same in both. If the trace is bogus, I would _really_ > appreciate any pointers you could give me to get better traces for you. > > Cheers! > Nic > > Unable to handle kernel paging request at virtual address 0804a59b > *pde = 1bde1067 > Oops: 0003 > CPU: 1 > EIP: [<c0116c6c>] Not tainted > Using defaults from ksymoops -t elf32-i386 -a i386 > EFLAGS: 00010046 > Process bpslave (pid: 31698, stackpage=d8a45000) > Stack: 00000001 00000282 00000003 dca85ef0 dc488cc0 00000000 dbd9800 > c010608c dca85ef0 dc40bfa0 00000000 e097af39 00000000 00000000 > 00000000 00000000 00000000 > 00000000 00000000 00000000 00000000 00000000 00000000 00000011 > Call Trace: [<c010608c>] [<e097af1d>] [<e096f329>] [<e097a698>] > [<e096f313>] > [<e097a698>] > Code: c7 01 00 00 00 00 8b 41 3c 85 c0 75 2d a1 c0 d1 26 c0 8d 51 > > >>EIP; c0116c6c <__wake_up+4c/c0> <===== > Trace; c010608c <__up_wakeup+8/c> > Trace; e097af1d <__module_using_checksums+5893/????> > Trace; e096f329 <[bproc]bproc_iod_release+3d/89> > Trace; e097a698 <__module_using_checksums+500e/????> > Trace; e096f313 <[bproc]bproc_iod_release+27/89> > Trace; e097a698 <__module_using_checksums+500e/????> Hrm. It's hard to give generic pointers. I usually try to look at the whole mess and try to figure it out. This back trace looks reasonable except for the __module_using_checksums part. That's weird. A lot of the time I get the module binary and look for the code in it to see what function it's in. In this case you're in __wake_up so that's not that useful. As always reproducing it is the most reliable way to go. I just saw some more weirdness so I'm looking into it. - Erik --__--__-- _______________________________________________ BProc-users mailing list BPr...@li... https://lists.sourceforge.net/lists/listinfo/bproc-users End of BProc-users Digest |