|
From: <ad...@jb...> - 2005-12-05 18:28:11
|
"
[ Permlink ]
Here's how I reproduced the bug.
1. Create the following classes
----- interface hatest.MyHASingletonServiceMBean -----
package hatest;
import org.jboss.ha.singleton.HASingletonMBean;
public interface MyHASingletonServiceMBean extends HASingletonMBean {
// nothing to add
}
----------
----- class hatest.MyHASingletonService -----
package hatest;
import org.jboss.ha.singleton.HASingletonSupport;
import org.jboss.logging.Logger;
public class MyHASingletonService extends HASingletonSupport implements MyHASingletonServiceMBean {
private static final Logger logger = Logger.getLogger(MyHASingletonService.class);
public void startSingleton() {
logger.info("I am the Master!");
}
public void stopSingleton() {
logger.info("I am no longer the Master.");
throw new RuntimeException("I don't want to die!");
}
}
----------
2. Package the classes in a SAR with the following jboss-service.xml
----- ha-test.sar/META-INF/jboss-service.xml -----
<?xml version="1.0" encoding="UTF-8"?>
jboss:service=${jboss.partition.name:DefaultPartition}
----------
3. Deploy to JBoss
15:47:02,978 INFO [MyHASingletonService] I am the Master!
4. Redeploy the SAR by touching META-INF/jboss-service.xml
15:47:18,168 INFO [MyHASingletonService] I am no longer the Master.
15:47:18,170 WARN [MyHASingletonService] Stopping failed hatest:service=MyHASingletonService
java.lang.RuntimeException: I don't want to die!
at hatest.MyHASingletonService.stopSingleton(MyHASingletonService.java:15)
...
15:47:18,348 WARN [ServiceController] Problem starting service hatest:service=MyHASingletonService
java.lang.NullPointerException
at org.jboss.ha.jmx.HAServiceMBeanSupport.getServiceHAName(HAServiceMBeanSupport.java:361)
at org.jboss.ha.jmx.HAServiceMBeanSupport$1.replicantsChanged(HAServiceMBeanSupport.java:195)
...
Comment by Mirko Nasato [03/Dec/05 11:08 AM] Delete
[ Permlink ]
Attached the zipped ha-test.sar used to reproduce the bug.
Comment by Mirko Nasato [03/Dec/05 11:12 AM] Delete
[ Permlink ]
Regular (non HA-singleton) MBean do not have this problem, i.e. they are redeployed correctly even if they throw an exception when undeploying.
Comment by Mirko Nasato [03/Dec/05 11:51 AM] Delete
[ Permlink ]
In the real world situation we weren't throwing a RuntimeException on purpose of course. A ClassCastException was generated because of another problem, a JDNI object being replaced by another one in a different ClassLoader by NonSerializableFactory after another EAR was deployed.
This bug effectively turned what was supposed to be a "high availability" service into a "zero availability" one.
Comment by Scott Marlow [05/Dec/05 08:51 AM] Delete
[ Permlink ]
This is a 50/50 problem. If the stop _stopOldMaster() fails, should the operation continue? There are probably some cases where the answer would be yes and some no.
If we change this for the 4.0.4 release, to catch the exception, log it and resume starting the new master (makeThisNodeMaster()). We would return from _stopOldMaster not knowing if the old singleton has stopped or not.
I'll go ahead and make the change as it should help in the case that you hit.
Comment by Scott Marlow [05/Dec/05 10:09 AM] Delete
[ Permlink ]
As noted in my comments, this doesn't completely solve the problem, the root cause of the exception still needs to be solved as its unknown of the singleton stopped or not.
Comment by Scott Marlow [05/Dec/05 10:10 AM] Delete
[ Permlink ]
The code change is in head and 4.0.4
"
View the original post : http://www.jboss.com/index.html?module=bb&op=viewtopic&p=3910725#3910725
Reply to the post : http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&p=3910725
|
|
From: <ad...@jb...> - 2005-12-05 18:29:08
|
I think this fix is correct. But we should discuss the problem FIRST rather than just hacking a workaround for this particular usecase. View the original post : http://www.jboss.com/index.html?module=bb&op=viewtopic&p=3910727#3910727 Reply to the post : http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&p=3910727 |
|
From: mnasato <do-...@jb...> - 2005-12-05 21:51:11
|
I'll explain my particular case in even more detail, so you can see if it can apply to others as well. The first time the HASingleton threw an exception when stopping was when one node was shutdown (because of other problems with that node). The second node (we have only 2 nodes in the cluster at the moment) refused to become the master | 2005-11-18 12:14:19,721 ERROR [ourapp.HASingletonScheduledService] _stopOldMaster failed. New master singleton will not start. | In this case not starting the new master was clearly not the best choice, because for sure the old master had been stopped despite the exception, the whole application server being stopped. After the problem occurred we tried redeploying the EAR containing the service to try and restore the service without affecting the other EARs running in the same appserver, but each time we got that NPE in HAServiceMBeanSupport.getServiceHAName(). So eventually we had to bring down both JBoss nodes, which means an outage in all the applications deployed in that cluster, just to have that single HASingleton service start up again. I agree that in other cases it may not be a good idea to start the new master if the old one failed to stop because you could end up having the HASingleton service running on more than one node. But I think this is somewhat less likely to happen, as like in our case the service may be teared down anyway because it's being stopped as part of a server shutdown, or because it throws an exception while trying to close a resource that's already been closed so it's effectively already stopped. And if it does happen it's much easier to fix the situation. If your service is running on 2 nodes when it shouldn't you can just stop the HASingleton on one node using the JMX console as a temporary measure, or restart one JBoss node so when it comes up again it's in a clean state. The situation we ended up required the whole cluster to be shut down and restarted which is far worse. Well this is my biased point of view anyway ;-) Thanks Mirko View the original post : http://www.jboss.com/index.html?module=bb&op=viewtopic&p=3910787#3910787 Reply to the post : http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&p=3910787 |
|
From: <ad...@jb...> - 2005-12-05 22:04:24
|
So there are really two questions for me: 1) Why does this fail in the first place? The lifecycle needs to fixed to avoid the NPE. Indeed, what is wrong with the lifeycle that the MBean has no name? 2) When it does fail what is the recovery? Clearly there is something going wrong if we are the master and we cannot stop ourselves? But we are probably stopping ourselves for a reason. With the exception of badly written subclasses (we can't do anything abou bugs in user's create/start/stop/destroyService), we should be able to stop and restart regardless of errors. We need to differentiate the problems: a) We cannot stop ourselves b) We cannot start the new master on a different node because we left the cluster or there is no cluster c) We cannot start the new master on a different node because of some other error reported by the new master and consider how to recover from them and report the underlying issue. View the original post : http://www.jboss.com/index.html?module=bb&op=viewtopic&p=3910789#3910789 Reply to the post : http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&p=3910789 |
|
From: <bst...@jb...> - 2006-07-05 21:21:32
|
Thanks for the input on interoperability; we'll leave the issue for 5.0.1.CR1. I'm packing up my house right now, so will defer further comment for a couple weeks until I'm resettled :) View the original post : http://www.jboss.com/index.html?module=bb&op=viewtopic&p=3955674#3955674 Reply to the post : http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&p=3955674 |
|
From: ScottMarlowNovell <do-...@jb...> - 2006-02-21 01:47:26
|
I agree that the root cause of the failure should be solved. Each occurence of a failure is a separate issue from the HASingleton itself failing.
My take on recovery:
>a) We cannot stop ourselves
Send a message to event listener indicating that we cannot stop ourselves (this might send email or a beeper notification.) Let user policy (code or configuration policy determine if we should {terminate server process, ignore error, try again}. Default action could be ignore.
> b) We cannot start the new master on a different node because we left the cluster or there is no cluster
I think that the current master will attempt stopping itself when HASingletonSupport.partitionTopologyChanged() is invoked. We could send a message to event listener indicating that we left the cluster or there is no cluster. I may be reading the current code wrong, but it looks like the remaining cluster will elect a new master (need to verify.)
Let user policy (code or configuration policy determine if we should {terminate server process, ignore error, try again}. Default action could be ignore.
>c) We cannot start the new master on a different node because of some other error reported by the new master
We could send a message to event listener on the different node indicating that it failed to become master. Let user policy (code or configuration policy determine if we should {terminate server process, ignore error, try again}. Default action could be terminate server process so that a new master is chosen.
A nice thing would be if cluster management Failures were defined as an aspect that could be handled consistently across the board. The problem that I am thinking of is how we deal with failures across the board, do we manually handle the errors or inject handlers that deal with varying qualities of service. Or perhaps I should be asking if we should wait until we switch to using AOP to attempt across the board handling of failures.
View the original post : http://www.jboss.com/index.html?module=bb&op=viewtopic&p=3925183#3925183
Reply to the post : http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&p=3925183
|
|
From: ScottMarlowNovell <do-...@jb...> - 2006-07-05 18:00:54
|
anonymous wrote : Scott, can you take a look at this in conjunction with what Alex did on configuring the HASingleton election policy and try to determine if solving this will require some change that breaks interoperability between releases (as opposed to just introducing a new feature). If it will break interoperability, please reschedule to 5.0.0.Beta. Otherwise, let's do this for 5.0.1.CR1. I don't see an interoperability issue here. >From a cluster management point of view, it would be nice to apply a generic solution for handling cluster errors such as SingletonService stop failure. The handling code as previously suggested, could deal with solving the problem or notifying someone that can deal with it (email/pager/send snmp trap). Otherwise, we can continue to log the error as we do now and continue execution (starting the new singleton) with the hope that something might work. Of course users can catch the exception directly in their code and do something better (if there is a better thing to do.) Should we invite someone from the JBoss ON team to give input on this (cluster management) issue? View the original post : http://www.jboss.com/index.html?module=bb&op=viewtopic&p=3955617#3955617 Reply to the post : http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&p=3955617 |
|
From: ScottMarlowNovell <do-...@jb...> - 2006-07-05 19:34:32
|
Alex and I discussed some of our options (using a policy driven approach versus a listener.)
It seems to me now that users can already detect the error in their implementation of stopSingleton by surrounding the logic with a try {} catch(java.lang.Throwable). So, I don't think we need to do anything else.
If we do decide to implement a cluster wide error listener, we could include the stopSingleton failure in the list of events that can be observed.
Other thoughts?
View the original post : http://www.jboss.com/index.html?module=bb&op=viewtopic&p=3955653#3955653
Reply to the post : http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&p=3955653
|