|
From: ScottMarlowNovell <do-...@jb...> - 2006-02-21 01:47:26
|
I agree that the root cause of the failure should be solved. Each occurence of a failure is a separate issue from the HASingleton itself failing.
My take on recovery:
>a) We cannot stop ourselves
Send a message to event listener indicating that we cannot stop ourselves (this might send email or a beeper notification.) Let user policy (code or configuration policy determine if we should {terminate server process, ignore error, try again}. Default action could be ignore.
> b) We cannot start the new master on a different node because we left the cluster or there is no cluster
I think that the current master will attempt stopping itself when HASingletonSupport.partitionTopologyChanged() is invoked. We could send a message to event listener indicating that we left the cluster or there is no cluster. I may be reading the current code wrong, but it looks like the remaining cluster will elect a new master (need to verify.)
Let user policy (code or configuration policy determine if we should {terminate server process, ignore error, try again}. Default action could be ignore.
>c) We cannot start the new master on a different node because of some other error reported by the new master
We could send a message to event listener on the different node indicating that it failed to become master. Let user policy (code or configuration policy determine if we should {terminate server process, ignore error, try again}. Default action could be terminate server process so that a new master is chosen.
A nice thing would be if cluster management Failures were defined as an aspect that could be handled consistently across the board. The problem that I am thinking of is how we deal with failures across the board, do we manually handle the errors or inject handlers that deal with varying qualities of service. Or perhaps I should be asking if we should wait until we switch to using AOP to attempt across the board handling of failures.
View the original post : http://www.jboss.com/index.html?module=bb&op=viewtopic&p=3925183#3925183
Reply to the post : http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&p=3925183
|