From: Shobana R. <sh...@cs...> - 2005-10-21 07:23:49
|
I am working on the Nimbus 4 system based on Clustermatic on a 15 node cluster of Opterons. I am having intermittent problems running MPI applications on this. While trying to run the the ASCI benchmark sPPM, 4 times out of 5, I get errors that look like this : p3_16410: p4_error: interrupt SIGSEGV: 11 p3_16410: (23.234375) net_send: could not write to fd=4, errno = 32 p4_16412: (33.445312) net_send: could not write to fd=4, errno = 32 p11_16426: (33.089844) net_send: could not write to fd=4, errno = 32 p2_16408: (33.558594) net_send: could not write to fd=4, errno = 32 p9_16422: (33.207031) net_send: could not write to fd=4, errno = 32 p4_error: latest msg from perror: Broken pipe p4_16412: p4_error: net_send write: -1 p4_error: latest msg from perror: Broken pipe p11_16426: p4_error: net_send write: -1 p4_error: latest msg from perror: Broken pipe p2_16408: p4_error: net_send write: -1 p4_error: latest msg from perror: Broken pipe p9_16422: p4_error: net_send write: -1 p12_16428: (34.140625) net_send: could not write to fd=4, errno = 32 p4_error: latest msg from perror: Broken pipe bm_list_16405: (71.308594) net_send: could not write to fd=4, errno = 9 p4_error: latest msg from perror: Bad file descriptor bm_list_16405: p4_error: net_send write: -1 .... and so on. This happens often but not always. Each time the node on which the seg fault occurs is different. The exact same code works fine on other clusters. My environment settings are BEOWULF_JOB_MAP=0:1:2:3:4:5:6:7:8:9:10:11:12:13:14 NP=15 Unfortunately, I dont have the sources for the MPICH. The installed version is in a directory named mpich-gnu-lila-p4-1.2.5..10. I tried compiling MPICH 1.2.7 for this, but I didn't find the patches to get it working with Nimbus. If anybody could help me with this problem, or give me pointers as to what to explore/debug, I would greatly appreciate it. Regards, Shobana |