The issue is seen on changeset 4733 + patches of CLM corresponding to changesets of #220. Continuous failovers are happening when some api invocations of IMM application are ongoing. The IMMD has asserted on the new active which is reported in the ticket #721
When both controllers got rebooted, the payloads did not get rebooted. Instead the opensaf services are up and running. CLM shows that both the payloads are not part of cluster. When the payloads are restarted manually, they joined the cluster.
PL-3 syslog:
Jan 15 18:23:09 SLES-64BIT-SLOT3 osafimmnd[3550]: NO implementer for class 'testMA_verifyObjApplNoResponseModCallback_101' is released => class extent is UNSAFE
Jan 15 18:23:59 SLES-64BIT-SLOT3 logger: Invoking failover from invoke_failover.sh
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafimmnd[3550]: WA DISCARD DUPLICATE FEVS message:92993
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafimmnd[3550]: WA Error code 2 returned for message type 57 - ignoring
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafimmnd[3550]: WA DISCARD DUPLICATE FEVS message:92994
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafimmnd[3550]: WA Error code 2 returned for message type 57 - ignoring
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafimmnd[3550]: WA Director Service in NOACTIVE state - fevs replies pending:1 fevs highest processed:92994
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafimmnd[3550]: NO No IMMD service => cluster restart
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafamfnd[3572]: NO 'safComp=IMMND,safSu=PL-3,safSg=NoRed,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'componentRestart'
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafimmnd[6827]: Started
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafimmnd[6827]: NO Persistent Back-End capability configured, Pbe file:imm.db (suffix may get added)
Jan 15 18:24:07 SLES-64BIT-SLOT3 kernel: [ 6343.176901] TIPC: Resetting link <1.1.3:eth0-1.1.2:eth0>, peer not responding
Jan 15 18:24:07 SLES-64BIT-SLOT3 kernel: [ 6343.176911] TIPC: Lost link <1.1.3:eth0-1.1.2:eth0> on network plane A
Jan 15 18:24:07 SLES-64BIT-SLOT3 kernel: [ 6343.176918] TIPC: Lost contact with <1.1.2>
Jan 15 18:24:07 SLES-64BIT-SLOT3 kernel: [ 6343.256091] TIPC: Resetting link <1.1.3:eth0-1.1.1:eth0>, peer not responding
Jan 15 18:24:07 SLES-64BIT-SLOT3 kernel: [ 6343.256100] TIPC: Lost link <1.1.3:eth0-1.1.1:eth0> on network plane A
Jan 15 18:24:07 SLES-64BIT-SLOT3 kernel: [ 6343.256106] TIPC: Lost contact with <1.1.1>
Jan 15 18:24:25 SLES-64BIT-SLOT3 kernel: [ 6361.425537] TIPC: Established link <1.1.3:eth0-1.1.2:eth0> on network plane A
Jan 15 18:24:27 SLES-64BIT-SLOT3 osafimmnd[6827]: NO SERVER STATE: IMM_SERVER_ANONYMOUS --> IMM_SERVER_CLUSTER_WAITING
Jan 15 18:24:27 SLES-64BIT-SLOT3 osafimmnd[6827]: NO SERVER STATE: IMM_SERVER_CLUSTER_WAITING --> IMM_SERVER_LOADING_PENDING
Jan 15 18:24:27 SLES-64BIT-SLOT3 osafimmnd[6827]: NO SERVER STATE: IMM_SERVER_LOADING_PENDING --> IMM_SERVER_LOADING_CLIENT
Jan 15 18:24:29 SLES-64BIT-SLOT3 osafimmnd[6827]: NO ERR_BAD_HANDLE: Admin owner 1 does not exist
Jan 15 18:24:36 SLES-64BIT-SLOT3 kernel: [ 6372.473240] TIPC: Established link <1.1.3:eth0-1.1.1:eth0> on network plane A
Jan 15 18:24:39 SLES-64BIT-SLOT3 osafimmnd[6827]: NO ERR_BAD_HANDLE: Admin owner 2 does not exist
Jan 15 18:24:39 SLES-64BIT-SLOT3 osafimmnd[6827]: NO NODE STATE-> IMM_NODE_LOADING
Jan 15 18:24:45 SLES-64BIT-SLOT3 osafimmnd[6827]: WA Number of objects in IMM is:5000
Jan 15 18:24:46 SLES-64BIT-SLOT3 osafimmnd[6827]: WA Number of objects in IMM is:6000
Jan 15 18:24:47 SLES-64BIT-SLOT3 osafimmnd[6827]: WA Number of objects in IMM is:7000
Jan 15 18:24:48 SLES-64BIT-SLOT3 osafimmnd[6827]: WA Number of objects in IMM is:8000
Jan 15 18:24:49 SLES-64BIT-SLOT3 osafimmnd[6827]: WA Number of objects in IMM is:9000
After both the controllers came up following is the status:
SLES-64BIT-SLOT1:~ # immlist safNode=PL-3,safCluster=myClmCluster
Name Type Value(s)
========================================================================
safNode SA_STRING_T safNode=PL-3
saClmNodeLockCallbackTimeout SA_TIME_T 50000000000 (0xba43b7400, Thu Jan 1 05:30:50 1970)
saClmNodeIsMember SA_UINT32_T <empty>
saClmNodeInitialViewNumber SA_UINT64_T <empty>
saClmNodeID SA_UINT32_T <empty>
saClmNodeEE SA_NAME_T <empty>
saClmNodeDisableReboot SA_UINT32_T 0 (0x0)
saClmNodeCurrAddressFamily SA_UINT32_T <empty>
saClmNodeCurrAddress SA_STRING_T <empty>
saClmNodeBootTimeStamp SA_TIME_T <empty>
saClmNodeAdminState SA_UINT32_T 1 (0x1)
saClmNodeAddressFamily SA_UINT32_T <empty>
saClmNodeAddress SA_STRING_T <empty>
SaImmAttrImplementerName SA_STRING_T safClmService
SaImmAttrClassName SA_STRING_T SaClmNode
SaImmAttrAdminOwnerName SA_STRING_T IMMLOADER</empty></empty></empty></empty></empty></empty></empty></empty></empty>
SLES-64BIT-SLOT1:~ # immlist safAmfNode=PL-3,safAmfCluster=myAmfCluster
Name Type Value(s)
========================================================================
safAmfNode SA_STRING_T safAmfNode=PL-3
saAmfNodeSuFailoverMax SA_UINT32_T 2 (0x2)
saAmfNodeSuFailOverProb SA_TIME_T 1200000000000 (0x1176592e000, Thu Jan 1 05:50:00 1970)
saAmfNodeOperState SA_UINT32_T 2 (0x2)
saAmfNodeFailfastOnTerminationFailure SA_UINT32_T 0 (0x0)
saAmfNodeFailfastOnInstantiationFailure SA_UINT32_T 0 (0x0)
saAmfNodeClmNode SA_NAME_T safNode=PL-3,safCluster=myClmCluster (36)
saAmfNodeCapacity SA_STRING_T <empty>
saAmfNodeAutoRepair SA_UINT32_T 1 (0x1)
saAmfNodeAdminState SA_UINT32_T 1 (0x1)
SaImmAttrImplementerName SA_STRING_T safAmfService
SaImmAttrClassName SA_STRING_T SaAmfNode
SaImmAttrAdminOwnerName SA_STRING_T IMMLOADER</empty>
SLES-64BIT-SLOT3:/opt/goahead/tetware/opensaffire # /etc/init.d/opensafd status
safSISU=safSu=SC-1\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed2,safApp=OpenSAF
saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=SC-2\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed1,safApp=OpenSAF
saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
saAmfSISUHAState=STANDBY(2)
safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
saAmfSISUHAState=ACTIVE(1)
SLES-64BIT-SLOT3:/opt/goahead/tetware/opensaffire # cat /etc/opensaf/node_name
PL-3
SLES-64BIT-SLOT3:/opt/goahead/tetware/opensaffire # ps -ef | grep saf
root 3538 1 0 17:16 ? 00:00:00 /bin/sh /usr/lib64/opensaf/clc-cli/osaf-transport-monitor
root 3563 1 0 17:16 ? 00:00:00 /usr/lib64/opensaf/osafclmna --tracemask=0xffffffff
root 3572 1 0 17:16 ? 00:00:00 /usr/lib64/opensaf/osafamfnd --tracemask=0xffffffff
root 3582 1 0 17:16 ? 00:00:00 /usr/lib64/opensaf/osafsmfnd
root 3591 1 0 17:16 ? 00:00:00 /usr/lib64/opensaf/osafmsgnd
root 3608 1 0 17:16 ? 00:00:00 /usr/lib64/opensaf/osaflcknd
root 3617 1 0 17:16 ? 00:00:00 /usr/lib64/opensaf/osafckptnd
root 3626 1 0 17:16 ? 00:00:00 /usr/lib64/opensaf/osafamfwd
root 6827 1 1 18:24 ? 00:00:13 /usr/lib64/opensaf/osafimmnd --tracemask=0xffffffff
root 7490 3073 0 18:42 pts/0 00:00:00 grep saf
Same is with PL-4. Attached the AMF traces.
The setup is with 2PBE with nearly 24k objects
The issue is not reproducible for me during multiple failovers.
Please share the mds.log for the payloads. Enable MDS_LOG_LEVEL=5 for AMFND process on the payloads and share the mds.log also if the issue is hit the next time.
As per payload trace, Amfnd didn't get any VDest down and that is the reason, Amfnd didn't reboot the payload.
So, this ticket is not related to Amf, but should be seen from system perspectives having mds in mind for not delivering the VDest Down.
It should be seen from mds perspective for further analysis.
I think this ticket (#722) can be seen as a duplicate of #721
or vice-versa. Same root cause: changed logic in FM in 4.4
Both 721 and 722 have not been reproduced.
But yes, 722 happened immediately after 721 occurred in the same setup.
However, the problems are different in 721 and 722.
i.e. In 722, even though the controllers went for a reboot, the OpenSAF services are still alive on the payload and they should have recieved the service down events after which a reboot of the payload would have got trigerred.
While 721 most likely is a case of faster failover during which either the down events either were not delivered to IMM or where queued up in IMM procedess but the assert(coordinator election) happened before processing the down event. I'm still on it i.e. 721
Well #721 is not just a case of a "faster" failover, it is a case of an incorrect failover,
or "unclean" failover.
This because the failover is done before MDS down has been received.
We should revert the FM behavior back to how it was before unless someone can come
up with some other way to get the FM to postpone the failover trigger untill all the MDS
from the old active has been plugged by the MDS recevier library.
/AndersBj
From: Mathi Naickan [mailto:mathi-naickan@users.sf.net]
Sent: den 22 januari 2014 09:21
To: [opensaf:tickets]
Subject: [opensaf:tickets] #722 payloads did not go for reboot when both the controllers rebooted
Both 721 and 722 have not been reproduced.
But yes, 722 happened immediately after 721 occurred in the same setup.
However, the problems are different in 721 and 722.
i.e. In 722, even though the controllers went for a reboot, the OpenSAF services are still alive on the payload and they should have recieved the service down events after which a reboot of the payload would have got trigerred.
While 721 most likely is a case of faster failover during which either the down events either were not delivered to IMM or where queued up in IMM procedess but the assert(coordinator election) happened before processing the down event. I'm still on it i.e. 721
[tickets:#722] payloads did not go for reboot when both the controllers rebooted
Status: unassigned
Created: Thu Jan 16, 2014 07:36 AM UTC by Sirisha Alla
Last Updated: Wed Jan 22, 2014 07:55 AM UTC
Owner: nobody
The issue is seen on changeset 4733 + patches of CLM corresponding to changesets of #220. Continuous failovers are happening when some api invocations of IMM application are ongoing. The IMMD has asserted on the new active which is reported in the ticket #721
When both controllers got rebooted, the payloads did not get rebooted. Instead the opensaf services are up and running. CLM shows that both the payloads are not part of cluster. When the payloads are restarted manually, they joined the cluster.
PL-3 syslog:
Jan 15 18:23:09 SLES-64BIT-SLOT3 osafimmnd[3550]: NO implementer for class 'testMA_verifyObjApplNoResponseModCallback_101' is released => class extent is UNSAFE
Jan 15 18:23:59 SLES-64BIT-SLOT3 logger: Invoking failover from invoke_failover.sh
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafimmnd[3550]: WA DISCARD DUPLICATE FEVS message:92993
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafimmnd[3550]: WA Error code 2 returned for message type 57 - ignoring
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafimmnd[3550]: WA DISCARD DUPLICATE FEVS message:92994
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafimmnd[3550]: WA Error code 2 returned for message type 57 - ignoring
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafimmnd[3550]: WA Director Service in NOACTIVE state - fevs replies pending:1 fevs highest processed:92994
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafimmnd[3550]: NO No IMMD service => cluster restart
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafamfnd[3572]: NO 'safComp=IMMND,safSu=PL-3,safSg=NoRed,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'componentRestart'
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafimmnd[6827]: Started
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafimmnd[6827]: NO Persistent Back-End capability configured, Pbe file:imm.db (suffix may get added)
Jan 15 18:24:07 SLES-64BIT-SLOT3 kernel: [ 6343.176901] TIPC: Resetting link <1.1.3:eth0-1.1.2:eth0>, peer not responding
Jan 15 18:24:07 SLES-64BIT-SLOT3 kernel: [ 6343.176911] TIPC: Lost link <1.1.3:eth0-1.1.2:eth0> on network plane A
Jan 15 18:24:07 SLES-64BIT-SLOT3 kernel: [ 6343.176918] TIPC: Lost contact with <1.1.2>
Jan 15 18:24:07 SLES-64BIT-SLOT3 kernel: [ 6343.256091] TIPC: Resetting link <1.1.3:eth0-1.1.1:eth0>, peer not responding
Jan 15 18:24:07 SLES-64BIT-SLOT3 kernel: [ 6343.256100] TIPC: Lost link <1.1.3:eth0-1.1.1:eth0> on network plane A
Jan 15 18:24:07 SLES-64BIT-SLOT3 kernel: [ 6343.256106] TIPC: Lost contact with <1.1.1>
Jan 15 18:24:25 SLES-64BIT-SLOT3 kernel: [ 6361.425537] TIPC: Established link <1.1.3:eth0-1.1.2:eth0> on network plane A
Jan 15 18:24:27 SLES-64BIT-SLOT3 osafimmnd[6827]: NO SERVER STATE: IMM_SERVER_ANONYMOUS --> IMM_SERVER_CLUSTER_WAITING
Jan 15 18:24:27 SLES-64BIT-SLOT3 osafimmnd[6827]: NO SERVER STATE: IMM_SERVER_CLUSTER_WAITING --> IMM_SERVER_LOADING_PENDING
Jan 15 18:24:27 SLES-64BIT-SLOT3 osafimmnd[6827]: NO SERVER STATE: IMM_SERVER_LOADING_PENDING --> IMM_SERVER_LOADING_CLIENT
Jan 15 18:24:29 SLES-64BIT-SLOT3 osafimmnd[6827]: NO ERR_BAD_HANDLE: Admin owner 1 does not exist
Jan 15 18:24:36 SLES-64BIT-SLOT3 kernel: [ 6372.473240] TIPC: Established link <1.1.3:eth0-1.1.1:eth0> on network plane A
Jan 15 18:24:39 SLES-64BIT-SLOT3 osafimmnd[6827]: NO ERR_BAD_HANDLE: Admin owner 2 does not exist
Jan 15 18:24:39 SLES-64BIT-SLOT3 osafimmnd[6827]: NO NODE STATE-> IMM_NODE_LOADING
Jan 15 18:24:45 SLES-64BIT-SLOT3 osafimmnd[6827]: WA Number of objects in IMM is:5000
Jan 15 18:24:46 SLES-64BIT-SLOT3 osafimmnd[6827]: WA Number of objects in IMM is:6000
Jan 15 18:24:47 SLES-64BIT-SLOT3 osafimmnd[6827]: WA Number of objects in IMM is:7000
Jan 15 18:24:48 SLES-64BIT-SLOT3 osafimmnd[6827]: WA Number of objects in IMM is:8000
Jan 15 18:24:49 SLES-64BIT-SLOT3 osafimmnd[6827]: WA Number of objects in IMM is:9000
After both the controllers came up following is the status:
SLES-64BIT-SLOT1:~ # immlist safNode=PL-3,safCluster=myClmCluster
Name Type Value(s)
========================================================================
safNode SA_STRING_T safNode=PL-3
saClmNodeLockCallbackTimeout SA_TIME_T 50000000000 (0xba43b7400, Thu Jan 1 05:30:50 1970)
saClmNodeIsMember SA_UINT32_T
saClmNodeInitialViewNumber SA_UINT64_T
saClmNodeID SA_UINT32_T
saClmNodeEE SA_NAME_T
saClmNodeDisableReboot SA_UINT32_T 0 (0x0)
saClmNodeCurrAddressFamily SA_UINT32_T
saClmNodeCurrAddress SA_STRING_T
saClmNodeBootTimeStamp SA_TIME_T
saClmNodeAdminState SA_UINT32_T 1 (0x1)
saClmNodeAddressFamily SA_UINT32_T
saClmNodeAddress SA_STRING_T
SaImmAttrImplementerName SA_STRING_T safClmService
SaImmAttrClassName SA_STRING_T SaClmNode
SaImmAttrAdminOwnerName SA_STRING_T IMMLOADER
SLES-64BIT-SLOT1:~ # immlist safAmfNode=PL-3,safAmfCluster=myAmfCluster
Name Type Value(s)
========================================================================
safAmfNode SA_STRING_T safAmfNode=PL-3
saAmfNodeSuFailoverMax SA_UINT32_T 2 (0x2)
saAmfNodeSuFailOverProb SA_TIME_T 1200000000000 (0x1176592e000, Thu Jan 1 05:50:00 1970)
saAmfNodeOperState SA_UINT32_T 2 (0x2)
saAmfNodeFailfastOnTerminationFailure SA_UINT32_T 0 (0x0)
saAmfNodeFailfastOnInstantiationFailure SA_UINT32_T 0 (0x0)
saAmfNodeClmNode SA_NAME_T safNode=PL-3,safCluster=myClmCluster (36)
saAmfNodeCapacity SA_STRING_T
saAmfNodeAutoRepair SA_UINT32_T 1 (0x1)
saAmfNodeAdminState SA_UINT32_T 1 (0x1)
SaImmAttrImplementerName SA_STRING_T safAmfService
SaImmAttrClassName SA_STRING_T SaAmfNode
SaImmAttrAdminOwnerName SA_STRING_T IMMLOADER
SLES-64BIT-SLOT3:/opt/goahead/tetware/opensaffire # /etc/init.d/opensafd status
safSISU=safSu=SC-1\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed2,safApp=OpenSAF
saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=SC-2\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed1,safApp=OpenSAF
saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
saAmfSISUHAState=STANDBY(2)
safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
saAmfSISUHAState=ACTIVE(1)
SLES-64BIT-SLOT3:/opt/goahead/tetware/opensaffire # cat /etc/opensaf/node_name
PL-3
SLES-64BIT-SLOT3:/opt/goahead/tetware/opensaffire # ps -ef | grep saf
root 3538 1 0 17:16 ? 00:00:00 /bin/sh /usr/lib64/opensaf/clc-cli/osaf-transport-monitor
root 3563 1 0 17:16 ? 00:00:00 /usr/lib64/opensaf/osafclmna --tracemask=0xffffffff
root 3572 1 0 17:16 ? 00:00:00 /usr/lib64/opensaf/osafamfnd --tracemask=0xffffffff
root 3582 1 0 17:16 ? 00:00:00 /usr/lib64/opensaf/osafsmfnd
root 3591 1 0 17:16 ? 00:00:00 /usr/lib64/opensaf/osafmsgnd
root 3608 1 0 17:16 ? 00:00:00 /usr/lib64/opensaf/osaflcknd
root 3617 1 0 17:16 ? 00:00:00 /usr/lib64/opensaf/osafckptnd
root 3626 1 0 17:16 ? 00:00:00 /usr/lib64/opensaf/osafamfwd
root 6827 1 1 18:24 ? 00:00:13 /usr/lib64/opensaf/osafimmnd --tracemask=0xffffffff
root 7490 3073 0 18:42 pts/0 00:00:00 grep saf
Same is with PL-4. Attached the AMF traces.
Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/722/
To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/
Related
Tickets: #722
Need to RE-evaluate on latest staging and targeted for 4.7