Migrated from http://devel.opensaf.org/ticket/3061
When testing http://devel.opensaf.org/ticket/3056 I found the problem that SU restart does not follow the instantiation level as supposed to:
spec 3.8.2
"Within a service unit, the Availability Management Framework terminates the pre-instantiable components according to the configured instantiation level. All pre-instantiable components with the same instantiation level are terminated by the Avail-
ability Management Framework in parallel. Pre-instantiable components of a given level are only terminated by the Availability Management Framework when all pre-instantiable components with a higher instantiation level have been terminated.
As has been said previously, the instantiation level is only applicable during service unit instantiation and termination. As restarting a service unit means terminating the
service unit and instantiating it again, the instantiation level also applies when restart-ing a service unit."
It is obvious from the code in avnd_su_pres_inst_surestart_hdler():
/*
•If pi su, pick the first pi comp & trigger it's FSM with RestartEv?. */
if (m_AVND_SU_IS_PREINSTANTIABLE(su)) {
TRACE("PI SU:'%s'",su->name.value);
for (curr_comp = m_AVND_COMP_FROM_SU_DLL_NODE_GET(m_NCS_DBLIST_FIND_FIRST(&su->comp_list));
should pick the last component since this list is sorted by the instantiation level.
3.11.1.2:
"Restarting a service unit is achieved by the following actions:
• First, all components in the service unit are terminated in the order
dictated by their instantiation-levels.
• In a second step, all components in the service unit are instantiated
in the order dictated by their instantiation-levels."
That is not the case today since each component is restarted (not terminated and instantiated)
Diff:
For surestart admin operation a different enhancement ticket #1455 is raised.
This ticket is for surestart recovery escalation.
Spec sections related to surestart (recovery):
1))3.2.2.1 Presence State (Component):
The following actions set the presence state of a component to terminating:
• The Availability Management Framework invokes the
• SaAmfComponentTerminateCallbackT function (see Section 7.10.1),
• or the saAmfCSIRemoveCallback() function (see Section 7.9.3),
• or it executes the TERMINATE CLC-CLI function (see Section 4.7),
as applicable according to Table 37 on page 439, to terminate the component
gracefully.
• The Availability Management Framework abruptly terminates the component by
using one of the following interfaces, as applicable according to
Table 37 on page 439:
• by executing the CLEANUP CLC-CLI command (see Section 4.8),
• or by invoking the saAmfContainedComponentCleanupCallback()
(see Section 7.10.5),
• or by invoking the saAmfProxiedComponentCleanupCallback() (see
Section 7.10.3).
A component will enter the uninstantiated state if it is successfully terminated or cleaned up; it enters the termination-failed state if the cleanup operation fails.
A component is restarted by the Availability Management Framework in the context of error recovery and repair actions (for details, see Section 3.11) or in the context of a restart administrative operation (for details, see Section 9.4.7). Restarting a component means first terminating it and then instantiating it again (see Section 3.11.1.2).
Two different actions shall be undertaken by the Availability Management Framework regarding the component service instances assigned to a component when the component restart is needed:
• Keep the component service instances assigned to the component while the
component is restarted. This action is typically performed when it is faster to
restart the component than to reassign the component service instances to
another component. In this case, the presence state of the component is set to restarting while the component is being terminated and until it is instantiated again (or a failure occurs). Internally, in this particular scenario, the Availability Management Framework withdraws and reassigns exactly the same HA state on behalf of all component service instances to the component as was assigned to the component for various component service instances before the restart procedure, without evaluating the various criteria that the Availability Management Framework would normally assess before making such an assignment.
• Reassign the component service instances currently assigned to the component to another component before terminating/instantiating the component. In this case, the presence state of the component is not set to restarting but transitions through the other presence state values (typically in the absence of failures: terminating, uninstantiated, instantiating, and then instantiated) as the component is terminated and instantiated again.
The choice between these two policies is based on the saAmfCompDisableRestart configuration attribute of each component (see the SaAmfComp object class in Section 8.13.2).
2) 3.11.1.3.1 Restart Recovery Action
⇒ Restart all components of the service unit: all components of the service unit that contains the erroneous component are abruptly terminated and then instantiated again (see Section 3.11.1.2). This action is performed as a consequence of an escalation of an SA_AMF_COMPONENT_RESTART recommended recovery action.
Scope of implementation:
1)saAmfCompDisableRestart is false for all components in SU:
In this case surestart recovery invovles two steps:
a) First, all components in the SU are terminated abruplty in the order dictated by their instantiation-levels.
b) In a second step, all components in the SU are instantiated in the order dictated by their instantiation-levels.
Also if the components have assignments, same assignments will be reassigned to the respective components after successful restart of SU.
2)saAmfCompDisableRestart is true for atleast one component in SU:
In this case after cleaning up the failed component, AMF will start abruptly terminating the components honoring instantiation level in reverse order. During this process, AMF will take care of reassignment of CSIs assign to the component with this flag true.Since a component will be protecting CSIs of any given redundancy model, the process of reassignment depends upon the redundancy model characteristics. In 2N. NPM and NoRed case this will lead to reassignment of the whole SU. As of now in Nway and Nway Active redudancy model also, presence of such a non-restartable component will lead to reassignment of whole SU.
3) In both the cases above 1) and 2), component will undergo transition through terminating, uninstantiated, instantiating, and then instantiated state.
Comments are welcome.
Last edit: Praveen 2015-08-26
Attached are the patches for 4.5 and 4.6 branches.
1)4.5 : 315_4.5.patch (parent: 6841:0198c81ad4ad).
2)4.6: 315_4.6.patch (generated on parent: 6842:e9b051d8a81e)
For default branch 315.patch rebased on parent: 6844:ff30714b16bb tip.
changeset: 6965:9c236928fc99
parent: 6962:38d62e4723f7
user: praveen.malviya@oracle.com
date: Thu Oct 01 15:16:19 2015 +0530
summary: amf: fix spec deviation of surestart escalation [#315]
changeset: 6964:77eec8b9b0ef
branch: opensaf-4.6.x
parent: 6959:f6ea7df23b04
user: praveen.malviya@oracle.com
date: Thu Oct 01 15:15:40 2015 +0530
summary: amf: fix spec deviation of surestart escalation [#315]
changeset: 6963:582475094724
branch: opensaf-4.5.x
parent: 6958:4a830a35990f
user: praveen.malviya@oracle.com
date: Thu Oct 01 15:14:52 2015 +0530
summary: amf: fix spec deviation of surestart escalation [#315]
Related
Tickets:
#315