Menu

#2372 amf: CLM lock of two more nodes returns REPAIR_PENDING for first node.

5.0.2
fixed
Praveen
None
defect
clm
d
default
major
2017-03-28
2017-03-14
Praveen
No

Steps to reproduce:
1) Bring 4 nodes cluster up.
2) Deploy AMf demo on PL-3 and PL-4.
3) LOCK amfd nodes PL-3 and PL-4.
4) Make arranegements so that termination of amf_demo on PL-3 takes more time compare to PL-4.
5)From one terminal issue CLM lock of PL-3 first and in not time issue CLM lock of PL-4.

CLM and AMF traces are attached.
Analysis:
When AMFD gets CLM track callback for PL-3 it starts terminating amf demo on PL-3. When termination of amf_demo still going on AMF gets another track callback with rootcausetentity as PL-4. However callback contains information of PL-3 also. AMFD starts terminating amf_demo on PL-4 but at the same time it responds of PL-3 with invocation id of PL-4 callback. CLM assumes that PL-4 change_started completed and sends completion callback for PL-4. In this callback, AMF clears internal flags which monitors the graceful removal of nodes. Since AMF never responded for PL-3 callback, callback timer expires in CLMD and it sends complete callback to AMF. AMF thinks this is the case of nodefailover and tries to failover PL-3.

Note: In all these stages, CLM sends track callback with information of all the nodes. AMF registers params are:
SA_TRACK_CURRENT|SA_TRACK_CHANGES_ONLY|SA_TRACK_VALIDATE_STEP|SA_TRACK_START_STEP. I am still evaluating whther issue is in CLM or AMF. Since AMF registers for |SA_TRACK_CHANGES_ONLY| should CLM give information of all the nodes in all subsequent callbacks?
Also AMF should respond to callback when it has completed termination of comps.

2 Attachments

Related

Tickets: #2372
Wiki: ChangeLog-5.0.2
Wiki: ChangeLog-5.1.1

Discussion

  • Srikanth R

    Srikanth R - 2017-03-15

    From the starting of CLM implementation, the service doesn't support admin operations on more than one node simultaneously. There was a discussion ( or ticket) on the earlier trac ticket system that CLM doesn't support operation on two entities simultaneously.

    Below is the simple scenario to reproduce.

    -> Bring up CLM agent, and subscribe to the track callback. Do not respond to the START callback.

    -> Now perform CLM lock operation on the two payloads in two different terminals.

    -> In the CLM application, Respond to the callbacks only after invoking both admin operations.

    -> Both admin operations shall result in SA_AIS_ERR_REPAIR_PENDING return code. It seems that CLM doesn't store the invocation id for the initial admin op from the below output in syslog.

    Mar 15 11:54:20 SLES-1 osafamfd[3276]: NO Pending Response sent for CLM track callback::OK '7'

     
  • Praveen

    Praveen - 2017-03-16

    Hi,

    Srikanth: Thanks for the information.

    I have analyzed the situation. The two issues are same (one case AMF application comps are running on locked payloads). The message " NO Pending Response sent for CLM track callback::OK '7'" is because of AMF responding two times for same invocationid. For the case mentioned in ticket description this message is not observed because applications installed on locked nodes makes the difference. CLMS properly maintains invocationid for all clients per callback. So to understand the problem I considered a diferent case.

    Suppose one payload node PL-4 is locked and an application still has not responded for the track callbacks and another payload PL-3 is stopped (OpenSAF stop). Application is hosted on PL-5 and its track flags are same as AMFD: (SA_TRACK_CURRENT | SA_TRACK_CHANGES_ONLY | SA_TRACK_VALIDATE_STEP | SA_TRACK_START_STEP).
    In this case what is observed is when PL-4 is locked both AMF and app gets track callback for CHANGE_START.Here AMF responds for the callback but application does not respond. Now PL-3 is stopped. Here CLM delievers track callback for COMPLETED step but it contains numberOfItems=2 both payload PL-3 and PL-4. Even application the same.
    Application never responds for the PL-4 callback and node lock timer expires at CLMD and it again sends completed callback to both AMFD and application. Since both AMFD and application has registered for SA_TRACK_CHANGES_ONLY,I really doubt CLM should send callback for both PL-3 and PL-4. In the description of ticket I have pointed out this problem for CHANGE_START case. In CLM spec in section 3.5.2 SaClmClusterTrackCallbackT_4 page 51:

    The value of the numberOfItems attribute in the structure to which the
    notificationBuffer parameter points might be greater than the value of the
    numberOfMembers parameter if either the SA_TRACK_CHANGES flag or the
    SA_TRACK_CHANGES_ONLY flags is set, and one or more member nodes have left
    the cluster membership. In this case, the structure to which the
    notificationBuffer parameter points might contain information about the current
    members of the cluster and also about nodes that have recently left the cluster
    membership.

    I am going though ticket list and spec for more information regarding this.
    Thanks,
    Praveen

     
  • Praveen

    Praveen - 2017-03-23

    Hi,

    I think there is no problem from CLM perspective. I have checked in both of the cases above, initialViewNumber are passed correctly at all stages and an application always distingiushes based on the passed initialveiwnumber.
    So the fix is needed in AMF.
    I will sent out a patch.

    Thanks,
    Praveen

     
  • Praveen

    Praveen - 2017-03-27
    • status: accepted --> review
     
  • Praveen

    Praveen - 2017-03-27
    • summary: amf/clm: CLM lock of two more nodes returns REPAIR_PENDING for first node. --> amf: CLM lock of two more nodes returns REPAIR_PENDING for first node.
     
  • Praveen

    Praveen - 2017-03-28
    • status: review --> fixed
     

Log in to post a comment.