From: Bryan B. <bry...@an...> - 2003-07-31 02:06:57
|
It's absolutely specific to this process. We run hundreds of other jobs through this cluster with migration on without any problems. The load balancing works great except for these huge memory hogs. nomig is a workaround, except that if two large processes happen to launch on the same node, we want one of them to move to balance the memory load, since these are very long-running jobs. I'm wondering if increasing the swap space from "only" 8GB to the recommended 4X (== 16GB) would make a difference. I'd suspect this, except that the process hangs even when it's the only one in the entire system; i.e. it shouldn't be trying to migrate. Vance Morgan wrote: > Sounds like a problem with the process locking the machine. What > happens if you try openmosix with smaller, less demainding processes? > Setiathome seems to be a good test app, as is the openmosix test suite > from the openmosix contributors pages. > > I think the first goal is to see whether your problem is specific to the > process or the machine. > > Vance. > > On Wed, 2003-07-30 at 19:43, Bryan Bayerdorffer wrote: > >>Oh also, the console gets millions of messages of the form "Received an >>unauthorized information request from <IP address>" on the machine with the >>unkillable process. The addresses are other nodes in the cluster. I'm not >>using omdiscd. >> >> >>Bryan Bayerdorffer wrote: >> >>>We have a CPU-bound program that allocates a lot of memory (2.5GB). If >>>migration is enabled for this process---and even when only ONE such >>>process is running on our 8-node (16 CPU) cluster and all the other CPUs >>>are idle---the CPU on which the process is running becomes pinned at >>>100% system time, and the process is unkillable. >>> >>>If the process is started with 'nomig' it behaves normally. >>> >>>2.4.20-openmosix-1, Dual Xeons, 4GB RAM per node. >>> >>>Additional symptoms: >>> >>>- The node can't be rebooted with 'shutdown -r'; needs physical reset >>>- mtop hangs: can't SIGINT, but can SIGKILL. >>>- ps hangs if it tries to get the hung process' argv. >>> >>>Haven't tried 2.4.21 yet. Is it likely to help? >>> > > > -- .. ..-. ..- -.-. .- -. .-. . .- -.. - .... .. ... --. . - .- .-.. .. ..-. . Bryan Bayerdorffer br...@me... br...@sp... (Wit's End Computation Center) (Analog Devices) "Man's chief occupation is extermination of other animals and his own species, which, however, multiplies with such insistent rapidity as to infest the whole habitable earth and Canada." |