Fellow openmosix users,
Before I start, I'm really just a biologist so speak slowly and avoid
using long words.
We have a cluster of ten dual-processor P3 machines on which I
installed red hat with kernel 2.4.18 and then open mosix
2.4.22-openmosix3. All ten machines should be identical except that
I've replaced a couple of failed hard drives (so the sizes don't
match), and one of them (the head node under the original
configuration) has a second ethernet card which is unused. There are
also some stray files in /root on one of the nodes. The load balancing
works smoothly jobs finish in good time, etc.
Also of note - we have a RAID elsewhere in the building which is nfs
mounted on each of the nodes independently. OMFS is evidently
installed on each node, I don't know if that means jobs don't need to
migrate back to do I/O or not.
Please let me know if additional information would be useful and I will get it.
However, occasionally, for no reason I can discern, a job will go
comatose - asleep and unkillable. Sometimes the job thinks that it has
migrated (the destination node may or may not know this) and sometimes
not. Sometimes one or more of the process-id named directories in
/proc will become unreadable (possibly a result of multiple attempts
to kill the job?), but sometimes not. Often, ps and top will stop
working, but sometimes not. I haven't looked systematically enough to
say whether these events are even related - may be four totally
independent problems for all I know.
In general, the jobs in question are very large, treatments of X-ray
diffraction data (used to solve protein crystal structures.)
Individual threads can be upwards of 100MB, they run for several days,
multiple refinement cycles. They involve a significant amount of I/O,
fairly continuous during the course of execution.
Anyway, if this is a known problem with a known fix - otherwise, just
some advice on what to try. The versions derived from newer kernels
are more stable or not? I chose the oldest version (in terms of kernel
version) that had the drivers I needed for our ethernet cards. Any
help would be greatly appreciated.
I tried to search the mailing list archives for this question, did not
see it, but if suggestions are already there I apologize.
Hunt Lab/Structural Genomics Unit
Department of Biological Sciences
709 Fairchild Center
1212 Amsterdam Avenue
New York, New York 10027