I can not subscribe to developers' list so I post it here.
My test openssi cluster (1.0.0 rc1 RH9) used to crash under heavy or moderate load, when loadleveling is happening. I pinpointed the error to "is_loadlevelable" function in load_level.c. this line:
dentry = dget(PVP(p->p_vproc)->pvp_comm_de)
caused kernels oops because pvp_comm_de's reference count is 0. So I search the source tree for "pvp_comm_de" and found out the problem MIGHT be in cluster/ssi/vproc/dvp_vpops.c, where it called dput but did not set the pointer to NULL after.
After apply the patch below, it no longer crashs, I need someone to verify the logic of this patch and make sure that it won't have side effects.
--- cluster/ssi/vproc/dvp_vpops.c 2004-01-15 18:02:03.000000000 -0600
+++ /usr/src/redhat/BUILD/kernel-2.4.20/linux-2.4.20/cluster/ssi/vproc/dvp_vpops.c 2004-01-27 12:36:51.000000000 -0600
@@ -732,7 +732,9 @@
pv->pvp_pproc->exit_signal = -2;
+ pv->pvp_comm_de = NULL;
+ pv->pvp_comm_mnt = NULL;