|
From: Shobana R. <sh...@cs...> - 2005-10-21 07:23:49
|
I am working on the Nimbus 4 system based on Clustermatic on a 15 node
cluster of Opterons. I am having intermittent problems running MPI
applications on this.
While trying to run the the ASCI benchmark sPPM, 4 times out of 5, I get
errors that look like this :
p3_16410: p4_error: interrupt SIGSEGV: 11
p3_16410: (23.234375) net_send: could not write to fd=4, errno = 32
p4_16412: (33.445312) net_send: could not write to fd=4, errno = 32
p11_16426: (33.089844) net_send: could not write to fd=4, errno = 32
p2_16408: (33.558594) net_send: could not write to fd=4, errno = 32
p9_16422: (33.207031) net_send: could not write to fd=4, errno = 32
p4_error: latest msg from perror: Broken pipe
p4_16412: p4_error: net_send write: -1
p4_error: latest msg from perror: Broken pipe
p11_16426: p4_error: net_send write: -1
p4_error: latest msg from perror: Broken pipe
p2_16408: p4_error: net_send write: -1
p4_error: latest msg from perror: Broken pipe
p9_16422: p4_error: net_send write: -1
p12_16428: (34.140625) net_send: could not write to fd=4, errno = 32
p4_error: latest msg from perror: Broken pipe
bm_list_16405: (71.308594) net_send: could not write to fd=4, errno = 9
p4_error: latest msg from perror: Bad file descriptor
bm_list_16405: p4_error: net_send write: -1
.... and so on. This happens often but not always. Each time the node on
which the seg fault occurs is different. The exact same code works fine on
other clusters.
My environment settings are
BEOWULF_JOB_MAP=0:1:2:3:4:5:6:7:8:9:10:11:12:13:14
NP=15
Unfortunately, I dont have the sources for the MPICH. The installed
version is in a directory named mpich-gnu-lila-p4-1.2.5..10.
I tried compiling MPICH 1.2.7 for this, but I didn't find the patches to
get it working with Nimbus.
If anybody could help me with this problem, or give me pointers as to what
to explore/debug, I would greatly appreciate it.
Regards,
Shobana
|