Menu

#426 AMF: IMM returns ERR_TRY_AGAIN for saImmOiRtObjectUpdate() in an IMM initiated callback

4.3.3
fixed
None
defect
amf
-
4.2.1
major
2015-02-02
2013-05-31
No

Migrated from http://devel.opensaf.org/ticket/2821

ChangeSet?: 3730
Redundancy model: NWAY

Through SMF campaign performed lock, lock-in of SU and deleted its components and SupportedCSTypes of Class CompCSType.

Observed saImmOiRtObjectUpdate of below attributes is being performed by AMF which should not happen as these are not run-time modifiable
1. saAmfCompNumCurrActiveCSIs
2. saAmfSURestartCount
3. saAmfSUNumCurrStandbySIs
4. saAmfSUNumCurrActiveSIs

/var/log/messages show :

Sep 24 17:59:30 SLES11-SLOT-1 osafamfd[22351]: WA saImmOiRtObjectUpdate of 'safSupportedCsType=safVersion=4.0.0\,safCSType=NWAYCSBASETYPE_PI,safComp=COMP1SU5NWAYAPP,safSu=SU5,safSg=SGONE,safApp=NWAYAPP' saAmfCompNumCurrActiveCSIs failed with 6
Sep 24 17:59:30 SLES11-SLOT-1 osafamfd[22351]: WA saImmOiRtObjectUpdate of 'safSupportedCsType=safVersion=4.0.0\,safCSType=NWAYCSBASETYPE_PI,safComp=COMP2SU5NWAYAPP,safSu=SU5,safSg=SGONE,safApp=NWAYAPP' saAmfCompNumCurrActiveCSIs failed with 6
Sep 24 17:59:30 SLES11-SLOT-1 osafamfd[22351]: WA saImmOiRtObjectUpdate of 'safSupportedCsType=safVersion=4.0.0\,safCSType=NWAYCSBASETYPE_PI,safComp=COMP3SU5NWAYAPP,safSu=SU5,safSg=SGONE,safApp=NWAYAPP' saAmfCompNumCurrActiveCSIs failed with 6
Sep 24 17:59:30 SLES11-SLOT-1 osafamfd[22351]: WA saImmOiRtObjectUpdate of 'safComp=COMP1SU5NWAYAPP,safSu=SU5,safSg=SGONE,safApp=NWAYAPP' saAmfCompRestartCount failed with 6
Sep 24 17:59:30 SLES11-SLOT-1 osafamfd[22351]: WA saImmOiRtObjectUpdate of 'safComp=COMP1SU5NWAYAPP,safSu=SU5,safSg=SGONE,safApp=NWAYAPP' saAmfCompCurrProxyName failed with 6
Sep 24 17:59:30 SLES11-SLOT-1 osafamfd[22351]: WA saImmOiRtObjectUpdate of 'safComp=COMP2SU5NWAYAPP,safSu=SU5,safSg=SGONE,safApp=NWAYAPP' saAmfCompRestartCount failed with 6
Sep 24 17:59:30 SLES11-SLOT-1 osafamfd[22351]: WA saImmOiRtObjectUpdate of 'safComp=COMP2SU5NWAYAPP,safSu=SU5,safSg=SGONE,safApp=NWAYAPP' saAmfCompCurrProxyName failed with 6
Sep 24 17:59:30 SLES11-SLOT-1 osafamfd[22351]: WA saImmOiRtObjectUpdate of 'safComp=COMP3SU5NWAYAPP,safSu=SU5,safSg=SGONE,safApp=NWAYAPP' saAmfCompRestartCount failed with 6
Sep 24 17:59:30 SLES11-SLOT-1 osafamfd[22351]: WA saImmOiRtObjectUpdate of 'safComp=COMP3SU5NWAYAPP,safSu=SU5,safSg=SGONE,safApp=NWAYAPP' saAmfCompCurrProxyName failed with 6
Sep 24 17:59:30 SLES11-SLOT-1 osafamfd[22351]: WA saImmOiRtObjectUpdate of 'safSu=SU5,safSg=SGONE,safApp=NWAYAPP' saAmfSURestartCount failed with 6
Sep 24 17:59:30 SLES11-SLOT-1 osafamfd[22351]: WA saImmOiRtObjectUpdate of 'safSu=SU5,safSg=SGONE,safApp=NWAYAPP' saAmfSUNumCurrStandbySIs failed with 6
Sep 24 17:59:30 SLES11-SLOT-1 osafamfd[22351]: WA saImmOiRtObjectUpdate of 'safSu=SU5,safSg=SGONE,safApp=NWAYAPP' saAmfSUNumCurrActiveSIs failed with 6
Sep 24 17:59:30 SLES11-SLOT-1 osafimmnd[22297]: NO Ccb 140 COMMITTED (SMFSERVICE)
Sep 24 17:59:30 SLES11-SLOT-1 osafamfd[22351]: ER job_exec_imm_objupdate: update FAILED 12
Sep 24 17:59:30 SLES11-SLOT-1 osafamfd[22351]: ER job_exec_imm_objupdate: update FAILED 12
Sep 24 17:59:30 SLES11-SLOT-1 osafimmnd[22297]: NO Create of PERSISTENT runtime object 'smfRollbackData=00000001,smfRollbackElement=ccb_00000009,smfRollbackElement=ProcWrapup?,safSmfProc=amfClusterProc-1,safSmfCampaign=Campaign,safApp=safSmfService' by Impl safSmfCampaign=Campaign,safApp=safSmfService
Sep 24 17:59:30 SLES11-SLOT-1 osafamfd[22351]: ER job_exec_imm_objupdate: update FAILED 12

Changed 8 months ago by hafe ¶
■version changed from 4.2.2 to 4.2.1
Kind of a duplicate of http://devel.opensaf.org/ticket/2227

Changed 8 months ago by hafe ¶
Could you please rerun the test case with amfd trace on?

Changed 8 months ago by hrishikesh.chenna
■attachment osafamfd.tgz added
Attached amfd trace file for the same time stamp given in ticket log snippet.

Changed 8 months ago by hafe ¶
■summary changed from Non modifiable runtime attributes are being modified by AMF and returns ERR_TRY_AGAIN. to IMM returns ERR_TRY_AGAIN for saImmOiRtObjectUpdate() in an IMM initiated callback
The AMF model contains classes with pure runtime objects. What is shown in the trace is just a normal read of such an object which results in a callback to avd to return values to IMM. Thus the ticket summary is wrong and now changed.

The problem here is that IMM responds with TRYAGAIN in context of an IMM callback! The complete syslog and immnd trace is not there so there is not much more to be done. If they were provided it would be nice.

However this log:

Sep 24 17:59:30 SLES11-SLOT-1 osafamfd[22351]: ER job_exec_imm_objupdate: update FAILED 12

Could be avoided since it means that a deferred object update (due to an earlier TRYAGAIN) fails since the object does not exist in IMM. At object deletion avd could walk the deferred IMM job list and prune those jobs that are affected.

Changed 8 months ago by anders ¶
Severity of:

Sep 24 17:59:30 SLES11-SLOT-1 osafamfd[22351]: ER job_exec_imm_objupdate: update FAILED 12

should not be ER unless AMFD is terminating due to this, which would seem excessive.

Severity of this:

Sep 24 17:59:30 SLES11-SLOT-1 osafamfd[22351]: WA saImmOiRtObjectUpdate of

'safComp=COMP3SU5NWAYAPP,safSu=SU5,safSg=SGONE,safApp=NWAYAPP' saAmfCompCurrProxyName

failed with 6

should not be WA and not even be logged since the error is simply TRY_AGAIN.

When/if the service/OI has to give up its retry attempts due to its
realtime behavior requirements and duties towards other tasks, then
it should simply geive up and return ERR_NO_RESOURCES on the callback.
(Ideal would be to return ERR_TRY_AGAIN but that is not allowed according to
hte spec).

Changed 8 months ago by anders ¶
■milestone changed from future_releases to 4.2.3
follow-up: ↓ 7 Changed 8 months ago by anders ¶
The description for this ticket suggests that the local IMMND has crashed and restarted.
If that is the case, then that should of course be investigated.
If it is not a known and fixed problem, then a ticket should be written.

It could I suppose be an effect of a node being shut down, as part of the SMF campaign.

That would be covered by http://devel.opensaf.org/ticket/2099

in reply to: ↑ 6 Changed 8 months ago by hafe ¶
Replying to anders:

The description for this ticket suggests that the local IMMND has crashed and restarted.
If that is the case, then that should of course be investigated.
If it is not a known and fixed problem, then a ticket should be written.

It could I suppose be an effect of a node being shut down, as part of the SMF campaign.

That would be covered by http://devel.opensaf.org/ticket/2099

From the traces I can see that IMMND has not crashed. It is alive and executing a CCB where objects are deleted. What is strange is that an object delete is followed by read of that same object. The CCB is still not applied. The object contains pure runtime objects so AMFDs SaImmOiRtAttrUpdateCallbackT is invoked. The callback tries to return attribute values to IMMND with saImmOiRtObjectUpdate() but receives TRYAGAIN. This is logged at Warning level since it is perceived as strange! TRYAGAIN causes the object to be put into AMFD's job queue. Later when processed, the object no longer exist (for real) but a logging bug in job_exec_imm_objupdate() causes this to be logged at Error level.

So my conclusion:
1) AMFD's logging could be changed and corrected (but that would only hide this)

2) IMMND is returning TRYAGAIN at the same time calling an implementers SaImmOiRtAttrUpdateCallbackT, weird... Why bother calling the implementer if it is anyway not allowed to reply?

3) Most likely SMF is constantly reading deleted (CCB not applied) objects! Why is that? If really necessary it is probably only interested in config attributes anyway?

Would it be possible to get SMFD and IMMND traces also?

Changed 5 months ago by anders ¶
Went back to look at this unsolved ticket because it has similarities with #2922.
It could be (its likely) that the TRY_AGAIN from IMMND on the RtUpdate? is due to IMMD (active)
being down.

Unfortunately, the only info attached to this ticket is the osafamfd trace.
But it appears that the AMFD gets stuck in its progress of performing the
switchover, The key question is if the AMFD can get blocked on imm requests during
the switchover? My understanding was that the AMFD when getting try-again on such
a job would park the imm-rt-update in the AMFDs job queue. So it should not get stuck
and the switchover should get completed.
Indeed perhaps that is the case.
It is not totally clear what problem this ticket is reporting.

The slogan complains that an imm downcall inside an imm callback gets TRY_AGAIN
from the imm. But that is the way it is. The cluster state may change in the time
beween the immnd sending the callback request to the imma client and the imma client
acting on that callback. This is a posibillity and not an error.
The implementer of the callback could return an error on the callback as a way
to get rid of that job.

Changed 4 months ago by hafe ¶
■milestone changed from 4.2.3 to future_releases
removing milestone, ticket is not accepted thus not scheduled for 4.2.3

1 Attachments

Related

Tickets: #426
Wiki: ChangeLog-4.3.3
Wiki: ChangeLog-4.4.1

Discussion

  • Nagendra Kumar

    Nagendra Kumar - 2013-09-06
    • status: unassigned --> assigned
    • assigned_to: Nagendra Kumar
    • Milestone: future --> 4.4.FC
     
  • Nagendra Kumar

    Nagendra Kumar - 2013-12-02
    • assigned_to: Nagendra Kumar --> nobody
    • Milestone: 4.4.FC --> 4.2.5
     
  • Anders Bjornerstedt

    • summary: IMM returns ERR_TRY_AGAIN for saImmOiRtObjectUpdate() in an IMM initiated callback --> AMF: IMM returns ERR_TRY_AGAIN for saImmOiRtObjectUpdate() in an IMM initiated callback
     
  • Anders Bjornerstedt

    This is a very strange and complicated case and the information (syslogs)
    provided is incomplete.

    I suggest that the AMF gets out of this situation by abandoning the
    attempt to update the pure RTA. Instead it would just return
    ERR_NO_RESOURCES on the rtUpdate callback.

    The only function compromised by that would be some read performed by
    some om-client. They would get an error on that read.
    If (as I suspect) the root cause is an ongoing fail-over or switch-over
    and the unavailability of the IMMD (no functioning active). Then the
    case should be rare. A user getting an error on a search request during
    a failover/switchover should not result in anything catastrophic.

     
  • Anders Bjornerstedt

    • status: assigned --> unassigned
     
  • Praveen

    Praveen - 2014-02-17
    • Milestone: 4.2.5 --> 4.3.3
     
  • Nagendra Kumar

    Nagendra Kumar - 2014-06-16
    • status: unassigned --> assigned
    • assigned_to: Nagendra Kumar
     
  • Nagendra Kumar

    Nagendra Kumar - 2014-07-31

    I could reproduced it by keeping some sleep in compcstype_ccb_completed_cb after CCBUTIL_DELETE:
    case CCBUTIL_DELETE:
    TRACE_ENTER2("Before nag CCB ID ");
    sleep(4);
    The following command were run:
    immcfg -f /tmp/AppConfig-2N.xml
    From one terminal: immcfg -d safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1;
    From another terminal: immlist safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1;

    Jul 31 12:10:10 PM_SC-180 osafamfd[31053]: WA saImmOiRtObjectUpdate of 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' saAmfSURestartCount failed with 6
    Jul 31 12:10:10 PM_SC-180 osafimmnd[30996]: NO Ccb 4 COMMITTED (immcfg_PM_SC-180_31433)
    Jul 31 12:10:10 PM_SC-180 osafamfd[31053]: WA saImmOiRtObjectUpdate of 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' saAmfSUNumCurrStandbySIs failed with 12
    Jul 31 12:10:10 PM_SC-180 osafamfd[31053]: WA saImmOiRtObjectUpdate of 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' saAmfSUNumCurrActiveSIs failed with 12
    Jul 31 12:10:10 PM_SC-180 osafimmnd[30996]: ER Internal IMM server problem - failure from internal searchInit: 12

    Also:
    immcfg -d "safSupportedCsType=safVersion=1\,safC SType=AmfDemo1,safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1"
    immlist "safSupportedCsType=safVersion=1\,safCSType=AmfDemo1,safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1"

    Jul 31 11:48:04 PM_SC-180 osafimmnd[25894]: NO Ccb 2 COMMITTED (immcfg_PM_SC-180_26142)
    Jul 31 11:48:09 PM_SC-180 osafamfd[25952]: WA saImmOiRtObjectUpdate of 'safSupportedCsType=safVersion=1\,safCSType=AmfDemo1,safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' saAmfCompNumCurrActiveCSIs failed with 6
    Jul 31 11:48:09 PM_SC-180 osafimmnd[25894]: NO Ccb 3 COMMITTED (immcfg_PM_SC-180_26146)
    Jul 31 11:48:09 PM_SC-180 osafimmnd[25894]: ER Internal IMM server problem - failure from internal searchInit: 12

    So, the fix as suggested by Anders Bj could be returning SA_AIS_ERR_FAILED_OPERATION to compcstype_rt_attr_callback when avd_saImmOiRtObjectUpdate_sync fails. Any way, the object is being deleted, so attribute update may not be visible to anybody.

     
  • Nagendra Kumar

    Nagendra Kumar - 2014-07-31
    • status: assigned --> review
     
  • Nagendra Kumar

    Nagendra Kumar - 2014-08-12

    changeset: 5562:a4124a505df9
    tag: tip
    user: Nagendra Kumarnagendra.k@oracle.com
    date: Tue Aug 12 20:43:13 2014 +0530
    summary: amfd: return FAILED_OP to RtAttrUpdateCallbackT if RtObjectUpdate_2 fails [#426]

    [staging:a4124a]

     

    Related

    Tickets: #426
    Commit: [a4124a]

  • Nagendra Kumar

    Nagendra Kumar - 2014-08-21
    • status: review --> fixed
     
  • Nagendra Kumar

    Nagendra Kumar - 2014-08-21

    changeset: 5628:6d4c5e78b31f
    branch: opensaf-4.3.x
    user: Nagendra Kumarnagendra.k@oracle.com
    date: Thu Aug 21 11:42:11 2014 +0530
    summary: amfd: return FAILED_OP to RtAttrUpdateCallbackT if RtObjectUpdate_2 fails [#426]

    changeset: 5629:caabfba704e8
    branch: opensaf-4.4.x
    tag: tip
    parent: 5626:ae283908b0d5
    user: Nagendra Kumarnagendra.k@oracle.com
    date: Thu Aug 21 11:42:39 2014 +0530
    summary: amfd: return FAILED_OP to RtAttrUpdateCallbackT if RtObjectUpdate_2 fails [#426]

    [staging:6d4c5e]
    [staging:caabfb]

     

    Related

    Tickets: #426
    Commit: [6d4c5e]
    Commit: [caabfb]


Log in to post a comment.