Menu

#2188 amfd: avd_imm_impl_set fails causing node reboot

5.0.2
fixed
nobody
None
defect
amf
d
major
2016-12-05
2016-11-15
Gary Lee
No

avd_imm_impl_set fails causing node reboot

It seems there may have been simultaneous IMM reinit threads running.

Nov 14 02:24:26 SC-2-2 osafamfd[4174]: NO Re-initializing with IMM
Nov 14 02:24:26 SC-2-2 osafamfd[4174]: NO Re-initializing with IMM
Nov 14 02:24:26 SC-2-2 osafimmnd[16412]: NO Implementer connected: 44 (safAmfService) <526, 2020f>
Nov 14 02:24:26 SC-2-2 osafamfd[4174]: NO Finished re-initializing with IMM
Nov 14 02:24:26 SC-2-2 osafamfd[4174]: ER saImmOiImplementerSet failed 14
Nov 14 02:24:26 SC-2-2 osafamfd[4174]: ER exiting since avd_imm_impl_set failed
Nov 14 02:24:26 SC-2-2 osafimmnd[16412]: NO Implementer locally disconnected. Marking it as doomed 44 <526, 2020f> (safAmfService)
Nov 14 02:24:26 SC-2-2 osafamfnd[4192]: WA AMF director unexpectedly crashed
Nov 14 02:24:26 SC-2-2 osafamfnd[4192]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received, OwnNodeId = 131599, SupervisionTime = 60
Nov 14 02:24:26 SC-2-2 osafimmnd[16412]: NO Implementer disconnected 44 <526, 2020f> (safAmfService)

1 Attachments

Related

Tickets: #2101
Tickets: #2188
Wiki: ChangeLog-5.0.2
Wiki: ChangeLog-5.1.1

Discussion

1 2 > >> (Page 1 of 2)
  • Gary Lee

    Gary Lee - 2016-11-15
    • Attachments has changed:

    Diff:

    --- old
    +++ new
    @@ -0,0 +1 @@
    +amfd-core.txt (9.9 kB; text/plain)
    
     
  • Gary Lee

    Gary Lee - 2016-11-15

    backtrace

     
  • Praveen

    Praveen - 2016-11-15

    Hi Gary,

    In which scenario it is observed and please provide the traces.
    Is this reproducible?

    Thanks,
    Praveen

     
  • Gary Lee

    Gary Lee - 2016-11-15

    Hi Praveen

    It's rare but has occurred a few times. I've asked a tester to try and reproduce it with traces. I believe the test kills immnd causing BAD_HANDLE to be returned to AMF etc..

     

    Last edit: Gary Lee 2016-11-15
  • Praveen

    Praveen - 2016-11-16

    Hi Gary,
    From amfd-core.txt, it seems that avd_imm_reinit_bg() was called two times. This can happen when IMM returns BAD_HANDLE and AMFD calls avd_imm_reinit_bg() to init with in IMM in background. After this if again main thread tries to execute some job for IMM update, IMM will return BAD_HANDLE and AMFD will again call avd_imm_reinit_bg() to init with IMM.
    Attached is a patch that will avoid multiple calls to avd_imm_reinit_bg().

    Thanks,
    Praveen

     
  • Gary Lee

    Gary Lee - 2016-11-18

    Hi Praveen

    I think there might be problems with threads concurrently accessing avd_imm_status as it is not protected by a mutex. There are already some mutexes in avd_imm_reinit_bg_thread(). Are they supposed to prevent the original problem?

     
  • Gary Lee

    Gary Lee - 2016-11-18

    I'm not sure about this in avd_imm_reinit_bg():

    `(void) saImmOiFinalize(avd_cb->immOiHandle);
    
    avd_cb->immOiHandle = 0;
    avd_cb->is_implementer = false;`
    

    It looks like there could be a race with the same code in avd_imm_reinit_bg_thread(). And saImmOiFinalize(0) is called, which could explain "saImmOiImplementerSet failed 14".

     

    Last edit: Gary Lee 2016-11-18
  • Praveen

    Praveen - 2016-11-18

    Hi Gary,

    In the patch #2188, cb->avd_imm_status in always accesses in main thread, so I think no protection is needed for it.
    Also the check added in avd_imm_reinit_bg_thread() will prevent multiple threads running. avd_imm_reinit_bg_thread() is called from main thread only.
    Thanks,
    Praveen

     
  • Gary Lee

    Gary Lee - 2016-11-18

    Hi Praveen

    Ah I see. Does that mean we can remove the mutexes?

    Thanks

     
  • Praveen

    Praveen - 2016-11-18

    Hi Gary,
    The mutexes are used in role.cc for IMM related handling during failovers and switchovers.
    I think removal cannot be done so easily. But some refactoring can always be done based on approach taken for this ticket.

    Thanks,
    Praveen

     
  • Praveen

    Praveen - 2016-11-18

    Crash reported in the ticket "#2101 AMF: Heartbeat timeout observed after ImmNd restart" looks very much similar to this ticket.

     
  • Minh Hon Chau

    Minh Hon Chau - 2016-11-18

    With helps from Hung, I have seen 2 concurrent IMM reinit threads by adding delay in immpbe, log/trace is attached.

    Below is snippet

    2016-11-18 23:13:24 SC-1 osafamfd[488]: NO Re-initializing with IMM
    2016-11-18 23:13:24 SC-1 osafimmnd[432]: WA IMMND - Client Node Get Failed for cli_hdl 115964248335
    2016-11-18 23:13:24 SC-1 osafimmnd[432]: NO Implementer connected: 12 (safAmfService) <389, 2010f>
    2016-11-18 23:13:24 SC-1 osafamfd[488]: NO Re-initializing with IMM
    2016-11-18 23:13:24 SC-1 osafamfd[488]: NO Finished re-initializing with IMM
    2016-11-18 23:13:24 SC-1 osafimmnd[432]: NO Implementer locally disconnected. Marking it as doomed 12 <389, 2010f> (safAmfService)
    2016-11-18 23:13:24 SC-1 osafamfd[488]: ER exec: update FAILED 2
    2016-11-18 23:13:24 SC-1 osafamfd[488]: WA saImmOiRtObjectUpdate of 'safSg=AmfDemoTwon,safApp=AmfDemoTwon' osafAmfSGSuOperationList failed with 2
    2016-11-18 23:13:24 SC-1 osafamfd[488]: WA saImmOiRtObjectUpdate of 'safSg=AmfDemoTwon,safApp=AmfDemoTwon' osafAmfSGFsmState failed with 2
    2016-11-18 23:13:24 SC-1 osafimmnd[432]: NO Implementer disconnected 12 <389, 2010f> (safAmfService)
    2016-11-18 23:13:24 SC-1 osafimmnd[432]: NO Implementer connected: 13 (safAmfService) <390, 2010f>
    2016-11-18 23:13:24 SC-1 osafamfd[488]: ER ERR_BAD_OPERATION: The SaImmOiHandleT is not associated with any implementer name
    2016-11-18 23:13:24 SC-1 osafamfd[488]: WA saImmOiRtObjectUpdate of 'safSg=AmfDemoTwon,safApp=AmfDemoTwon' osafAmfSGSuOperationList failed with 20
    2016-11-18 23:13:24 SC-1 osafamfd[488]: ER ERR_BAD_OPERATION: The SaImmOiHandleT is not associated with any implementer name
    2016-11-18 23:13:24 SC-1 osafamfd[488]: WA saImmOiRtObjectUpdate of 'safSg=AmfDemoTwon,safApp=AmfDemoTwon' osafAmfSGFsmState failed with 20
    2016-11-18 23:13:24 SC-1 osafamfd[488]: ER exec: update FAILED 2
    2016-11-18 23:13:24 SC-1 osafimmnd[432]: WA Timeout on syncronous admin operation 1
    2016-11-18 23:13:24 SC-1 osafamfd[488]: NO Finished re-initializing with IMM

    In another run, I have seen "saImmOiImplementerSet failed 14", but logs were rotated
    Below is log left from console
    2016-11-18 22:37:27 SC-1 osafamfd[487]: NO Re-initializing with IMM
    2016-11-18 22:37:27 SC-1 osafimmnd[431]: WA IMMND - Client Node Get Failed for cli_hdl 115964248335
    2016-11-18 22:37:27 SC-1 osafimmnd[431]: NO Implementer connected: 12 (safAmfService) <471, 2010f>
    2016-11-18 22:37:27 SC-1 osafamfd[487]: NO Re-initializing with IMM
    2016-11-18 22:37:27 SC-1 osafamfd[487]: NO Finished re-initializing with IMM
    2016-11-18 22:37:27 SC-1 osafamfd[487]: ER exec: update FAILED 2
    2016-11-18 22:37:27 SC-1 osafimmnd[431]: NO Implementer locally disconnected. Marking it as doomed 12 <471, 2010f> (safAmfService)
    2016-11-18 22:37:27 SC-1 osafamfd[487]: NO Re-initializing with IMM
    2016-11-18 22:37:27 SC-1 osafimmnd[431]: NO Implementer disconnected 12 <471, 2010f> (safAmfService)
    2016-11-18 22:37:27 SC-1 osafimmnd[431]: NO Implementer connected: 13 (safAmfService) <472, 2010f>
    2016-11-18 22:37:27 SC-1 osafimmnd[431]: WA Timeout on syncronous admin operation 1
    2016-11-18 22:37:27 SC-1 osafamfd[487]: NO Finished re-initializing with IMM
    2016-11-18 22:37:27 SC-1 osafamfd[487]: ER saImmOiImplementerSet failed 14
    2016-11-18 22:37:27 SC-1 osafamfd[487]: ER exiting since avd_imm_impl_set failed
    2016-11-18 22:37:27 SC-1 osafamfnd[500]: WA AMF director unexpectedly crashed
    2016-11-18 22:37:27 SC-1 osafamfnd[500]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received, OwnNodeId = 131343, SupervisionTime = 60

     
  • Minh Hon Chau

    Minh Hon Chau - 2016-11-21

    Hi Praveen,

    Thanks for your patch, I think it could solve the problem of 2 concurrent init threads.
    However, I have seen another problem when testing the patch, that it probably needs to handle:
    During re-initializing IMM, the imm update will be lost:
    - if it calls from avd_saImmOiRtObjectUpdate_sync()
    - if it is a job that has IMM returned a code other than the ones are being handled in ImmObjUpdate::exec() , ie. SA_AIS_ERR_BAD_OPERATION

    Thanks,
    Minh

     
  • Gary Lee

    Gary Lee - 2016-11-21

    Agree with Minh, if avd_saImmOiRtObjectUpdate_sync() fails with BAD_HANDLE, we should call avd_saImmOiRtObjectUpdate() and queue the job as last resort.

    The following in avd_imm_reinit_bg() should be removed:

    `(void) saImmOiFinalize(avd_cb->immOiHandle);

    avd_cb->immOiHandle = 0;
    avd_cb->is_implementer = false;`

     
  • Praveen

    Praveen - 2016-11-21

    Hi Minh,

    I have incorporated all the comments and published officially.
    Please review and test it.

    Thanks,
    Praveen

     
  • Praveen

    Praveen - 2016-11-21
    • status: unassigned --> review
    • assigned_to: Praveen
    • Milestone: 5.1.1 --> 5.0.2
     
  • Minh Hon Chau

    Minh Hon Chau - 2016-11-22

    attached trace tested with lastest patch

     
  • Praveen

    Praveen - 2016-11-25
    • status: review --> fixed
     
  • Minh Hon Chau

    Minh Hon Chau - 2016-11-28

    Got a coredump

    Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
    Core was generated by `/usr/local/lib/opensaf/osafamfd'.
    Program terminated with signal SIGSEGV, Segmentation fault.
    #0  search (pTree=pTree@entry=0x7f612ca51440 <imma_cb+160>, key=key@entry=0x7ffc9d42b7d8 "\017\002\002") at patricia.c:93
    
    Thread 5 (Thread 0x7f612a799700 (LWP 880)):
    #0  0x00007f612b4cdfdd in poll () at ../sysdeps/unix/syscall-template.S:81
    No locals.
    #1  0x00007f612cea78a1 in poll (__timeout=30000, __nfds=1, __fds=0x7f612a798a30) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46
    No locals.
    #2  osaf_ppoll (io_fds=io_fds@entry=0x7f612a798a30, i_nfds=i_nfds@entry=1, i_timeout_ts=i_timeout_ts@entry=0x7f612a798a00, i_sigmask=i_sigmask@entry=0x0) at osaf_poll.c:105
            current_time = {tv_sec = 0, tv_nsec = 565213401186792}
            elapsed_time = {tv_sec = 0, tv_nsec = 0}
            time_left = 30000
            start_time = {tv_sec = 1483636, tv_nsec = 247331146}
            time_left_ts = <optimized out>
            result = 7237888
    #3  0x00007f612cea7a3b in osaf_poll (io_fds=io_fds@entry=0x7f612a798a30, i_nfds=i_nfds@entry=1, i_timeout=i_timeout@entry=30000) at osaf_poll.c:44
            timeout_ts = {tv_sec = 30, tv_nsec = 0}
    #4  0x00007f612cea7a85 in osaf_poll_one_fd (i_fd=19, i_timeout=i_timeout@entry=30000) at osaf_poll.c:128
            set = {fd = 19, events = 1, revents = 0}
            result = <optimized out>
    #5  0x00007f612c82ddac in imma_sync_with_immnd (cb=0x7f612ca513a0 <imma_cb>) at imma_init.c:82
    No locals.
    #6  imma_create (sv_id=NCSMDS_SVC_ID_IMMA_OI) at imma_init.c:168
            value = <optimized out>
            cb = 0x7f612ca513a0 <imma_cb>
            rc = <optimized out>
    #7  imma_startup (sv_id=NCSMDS_SVC_ID_IMMA_OI) at imma_init.c:289
            rc = 1
            pt_err = <optimized out>
            __FUNCTION__ = "imma_startup"
    #8  0x00007f612c8238db in initialize_common (immOiHandle=immOiHandle@entry=0x6eab78 <_control_block+632>, inout_version=inout_version@entry=0x7f612a798e30, client_node=client_node@entry=0x7f612a798e08) at imma_oi_api.c:129
            rc = SA_AIS_OK
            init_evt = {next = 0x5, type = 0, info = {imma = {type = IMMA_EVT_MDS_INFO, info = {initRsp = {immHandle = 140055346857416, error = SA_AIS_OK}, errRsp = {error = 758302152, errStrings = 0x7f6100000001}, admInitRsp = {error = 758302152, ownerId = 32609}, ccbInitRsp = {error = 758302152, ccbId = 32609}, searchInitRsp = {error = 758302152, searchId = 32609}, searchNextRsp = 0x7f612d32c5c8, searchBundleNextRsp = 0x7f612d32c5c8, searchRemote = {client_hdl = 140055346857416, requestNodeId = 1, remoteNodeId = 32609, searchId = 758277344, objectName = {size = 0, buf = 0x7f612d326838 "\200\364\062-a\177"}, attributeNames = 0x0}, admOpReq = {adminOwnerId = 758302152, invocation = 32609, operationId = 140054588555265, continuationId = 140055346832608, timeout = 0, objectName = {size = 758278200, buf = 0x0}, params = 0x7f612d32c920}, admOpRsp = {oi_client_hdl = 140055346857416, invocation = 140054588555265, result = 758277344, error = 32609, parms = 0x0}, objCreate = {ccbId = 758302152, adminOwnerId = 32609, className = {size = 1, buf = 0x7f612d3264e0 ""}, parentOrObjectDn = {size = 0, buf = 0x7f612d326838 "\200\364\062-a\177"}, attrValues = 0x0, immHandle = 140055346858272}, objDelete = {ccbId = 758302152, adminOwnerId = 32609, objectName = {size = 1, buf = 0x7f612d3264e0 ""}, immHandle = 0}, objModify = {ccbId = 758302152, adminOwnerId = 32609, objectName = {size = 1, buf = 0x7f612d3264e0 ""}, attrMods = 0x0, immHandle = 140055346833464}, ccbCompl = {ccbId = 758302152, implId = 32609, invocation = 1, immHandle = 140055346832608}, classDescr = {className = {size = 758302152, buf = 0x7f6100000001 <error: Cannot access memory at address 0x7f6100000001>}, classCategory = 758277344, attrDefinitions = 0x0}, implSetRsp = {error = 758302152, implId = 32609}, tmr_info = {type = 758302152, adm_owner_hdl = 140054588555265, client_hdl = 140055346832608, invocation = 0}}}, immnd = {dont_free_me = true, unused1 = false, unused2 = false, unused3 = false, error = 0, type = 758302152, info = {initReq = {version = {releaseCode = 1 '\001', majorVersion = 0 '\000', minorVersion = 0 '\000'}, client_pid = 32609}, finReq = {client_hdl = 140054588555265}, adminitReq = {client_hdl = 140054588555265, i = {adminOwnerName = {_opaque = {25824, 11570, 32609, 0, 0, 0, 0, 0, 26680, 11570, 32609, 0, 0, 0, 0, 0, 51488, 11570, 32609, 0, 35632, 10873, 32609, 0, 35616, 10873, 32609, 0, 64589, 28266, 0, 0, 25263, 11497, 32609, 0, 65535, 65535, 0, 0, 5114, 11394, 32609, 0, 8696, 11497, 32609, 0, 50632, 11570, 32609, 0, 18944, 11497, 32609, 0, 50632, 11570, 32609, 0, 1, 0, 0, 0, 24344, 11497, 32609, 0, 35824, 10873, 32609, 0, 32320, 11536, 32609, 0, 0, 0, 0, 0, 36008, 10873, 32609, 0, 34576, 11396, 32609, 0, 4304, 11429, 32609, 0, 0, 0, 0, 0, 32824, 11536, 32609, 0, 0, 0, 0, 0, 36008, 10873, 32609, 0, 43896, 110, 0, 0, 43912, 110, 0, 0, 27, 0, 0, 0, 42279, 11537, 32609, 0, 35741, 11396, 32609, 0, 0}}, releaseOwnershipOnFinalize = SA_FALSE}}, ccbinitReq = {adminOwnerId = 1, ccbFlags = 140055346832608, client_hdl = 0}, implSet = {client_hdl = 140054588555265, impl_name = {size = 758277344, buf = 0x0}, impl_id = 758278200, scope = 32609, oi_timeout = 0}, admFinReq = {adm_owner_id = 1}, admReq = {adm_owner_id = 1, scope = 32609, objectNames = 0x7f612d3264e0}, admOpReq = {adminOwnerId = 1, invocation = 32609, operationId = 140055346832608, continuationId = 0, timeout = 140055346833464, objectName = {size = 0, buf = 0x7f612d32c920 "\200\364\062-a\177"}, params = 0x7f612a798b30}, fevsReq = {sender_count = 140054588555265, reply_dest = 140055346832608, client_hdl = 0, msg = {size = 758278200, buf = 0x0}, isObjSync = 32 ' '}, admOpRsp = {oi_client_hdl = 140054588555265, invocation = 140055346832608, result = 0, error = 0, parms = 0x7f612d326838}, ccbUpcallRsp = {oi_client_hdl = 140054588555265, ccbId = 758277344, implId = 32609, inv = 0, result = 0, name = {_opaque = {26680, 11570, 32609, 0, 0, 0, 0, 0, 51488, 11570, 32609, 0, 35632, 10873, 32609, 0, 35616, 10873, 32609, 0, 64589, 28266, 0, 0, 25263, 11497, 32609, 0, 65535, 65535, 0, 0, 5114, 11394, 32609, 0, 8696, 11497, 32609, 0, 50632, 11570, 32609, 0, 18944, 11497, 32609, 0, 50632, 11570, 32609, 0, 1, 0, 0, 0, 24344, 11497, 32609, 0, 35824, 10873, 32609, 0, 32320, 11536, 32609, 0, 0, 0, 0, 0, 36008, 10873, 32609, 0, 34576, 11396, 32609, 0, 4304, 11429, 32609, 0, 0, 0, 0, 0, 32824, 11536, 32609, 0, 0, 0, 0, 0, 36008, 10873, 32609, 0, 43896, 110, 0, 0, 43912, 110, 0, 0, 27, 0, 0, 0, 42279, 11537, 32609, 0, 35741, 11396, 32609, 0, 0, 0, 0, 0, 5, 0, 32609, 0, 8696}}, errorString = {size = 746879440, buf = 0x0}}, classDescr = {className = {size = 1, buf = 0x7f612d3264e0 ""}, classCategory = 0, attrDefinitions = 0x7f612d326838}, objCreate = {ccbId = 1, adminOwnerId = 32609, className = {size = 758277344, buf = 0x0}, parentOrObjectDn = {size = 758278200, buf = 0x0}, attrValues = 0x7f612d32c920, immHandle = 140055301163824}, objModify = {ccbId = 1, adminOwnerId = 32609, objectName = {size = 758277344, buf = 0x0}, attrMods = 0x7f612d326838, immHandle = 0}, objDelete = {ccbId = 1, adminOwnerId = 32609, objectName = {size = 758277344, buf = 0x0}, immHandle = 140055346833464}, obj_sync = {className = {size = 1, buf = 0x7f612d3264e0 ""}, objectName = {size = 0, buf = 0x7f612d326838 "\200\364\062-a\177"}, attrValues = 0x0, next = 0x7f612d32c920}, finSync = {lastContinuationId = 1, adminOwners = 0x7f612d3264e0, implementers = 0x0, classes = 0x7f612d326838, ccbResults = 0x0}, ccbId = 1, searchOp = {client_hdl = 140054588555265, searchId = 758277344}, searchInit = {client_hdl = 140054588555265, rootName = {size = 758277344, buf = 0x0}, scope = 758278200, searchOptions = 0, searchParam = {present = 758303008, choice = {oneAttrParam = {attrName = {size = 712608560, buf = 0x7f612a798b20 "\377\377\377\377"}, attrValueType = 1852505165, attrValue = {val = {saint32 = 753492655, sauint32 = 753492655, saint64 = 140055342047919, sauint64 = 140055342047919, satime = 140055342047919, safloat = 6.63321438e-12, sadouble = 6.9196533022421794e-310, x = {size = 753492655, buf = 0xffffffff <error: Cannot access memory at address 0xffffffff>}}}}}}, attributeNames = 0x7f612c8213fa, ccbId = 753476088}, rtAttUpdRpl = {sr = {client_hdl = 140054588555265, requestNodeId = 758277344, remoteNodeId = 32609, searchId = 0, objectName = {size = 758278200, buf = 0x0}, attributeNames = 0x7f612d32c920}, result = 712608560}, searchRemote = {client_hdl = 140054588555265, requestNodeId = 758277344, remoteNodeId = 32609, searchId = 0, objectName = {size = 758278200, buf = 0x0}, attributeNames = 0x7f612d32c920}, rspSrchRmte = {result = SA_AIS_OK, requestNodeId = 32609, remoteNodeId = 758277344, searchId = 32609, runtimeAttrs = {objectName = {size = 0, buf = 0x7f612d326838 "\200\364\062-a\177"}, attrValuesList = 0x0}}, ctrl = {nodeId = 1, rulingEpoch = 32609, fevsMsgStart = 140055346832608, ndExecPid = 0, canBeCoord = 0 '\000', isCoord = 0 '\000', syncStarted = 0 '\000', nodeEpoch = 758278200, pbeEnabled = 97 'a', dir = {size = 0, buf = 0x7f612d32c920 "\200\364\062-a\177"}, xmlFile = {size = 712608560, buf = 0x7f612a798b20 "\377\377\377\377"}, pbeFile = {size = 1852505165, buf = 0x7f612ce962af "ncs_core_agents_shutdown"}}, adminitGlobal = {globalOwnerId = 1, i = {adminOwnerName = {_opaque = {32609, 0, 25824, 11570, 32609, 0, 0, 0, 0, 0, 26680, 11570, 32609, 0, 0, 0, 0, 0, 51488, 11570, 32609, 0, 35632, 10873, 32609, 0, 35616, 10873, 32609, 0, 64589, 28266, 0, 0, 25263, 11497, 32609, 0, 65535, 65535, 0, 0, 5114, 11394, 32609, 0, 8696, 11497, 32609, 0, 50632, 11570, 32609, 0, 18944, 11497, 32609, 0, 50632, 11570, 32609, 0, 1, 0, 0, 0, 24344, 11497, 32609, 0, 35824, 10873, 32609, 0, 32320, 11536, 32609, 0, 0, 0, 0, 0, 36008, 10873, 32609, 0, 34576, 11396, 32609, 0, 4304, 11429, 32609, 0, 0, 0, 0, 0, 32824, 11536, 32609, 0, 0, 0, 0, 0, 36008, 10873, 32609, 0, 43896, 110, 0, 0, 43912, 110, 0, 0, 27, 0, 0, 0, 42279, 11537, 32609, 0, 35741, 11396, 32609}}, releaseOwnershipOnFinalize = SA_FALSE}}, ccbinitGlobal = {globalCcbId = 1, i = {adminOwnerId = 758277344, ccbFlags = 0, client_hdl = 140055346833464}}, mds_info = {change = NCSMDS_NO_ACTIVE, dest = 140055346832608, svc_id = 0, node_id = 0, role = 758278200}, syncFevsBase = 140054588555265}}, immd = {type = IMMD_EVT_MDS_INFO, info = {ctrl_msg = {ndExecPid = 758302152, epoch = 32609, refresh = 1 '\001', pbeEnabled = 0 '\000', dir = {size = 758277344, buf = 0x0}, xmlFile = {size = 758278200, buf = 0x0}, pbeFile = {size = 758303008, buf = 0x7f612a798b30 "\370!\351,a\177"}, fevs_count = 140055301163808, admo_id_count = 1852505165, ccb_id_count = 0, impl_count = 753492655}, admown_init = {client_hdl = 140055346857416, i = {adminOwnerName = {_opaque = {1, 0, 32609, 0, 25824, 11570, 32609, 0, 0, 0, 0, 0, 26680, 11570, 32609, 0, 0, 0, 0, 0, 51488, 11570, 32609, 0, 35632, 10873, 32609, 0, 35616, 10873, 32609, 0, 64589, 28266, 0, 0, 25263, 11497, 32609, 0, 65535, 65535, 0, 0, 5114, 11394, 32609, 0, 8696, 11497, 32609, 0, 50632, 11570, 32609, 0, 18944, 11497, 32609, 0, 50632, 11570, 32609, 0, 1, 0, 0, 0, 24344, 11497, 32609, 0, 35824, 10873, 32609, 0, 32320, 11536, 32609, 0, 0, 0, 0, 0, 36008, 10873, 32609, 0, 34576, 11396, 32609, 0, 4304, 11429, 32609, 0, 0, 0, 0, 0, 32824, 11536, 32609, 0, 0, 0, 0, 0, 36008, 10873, 32609, 0, 43896, 110, 0, 0, 43912, 110, 0, 0, 27, 0, 0, 0, 42279, 11537, 32609, 0, 35741}}, releaseOwnershipOnFinalize = (SA_TRUE | unknown: 32608)}}, ccb_init = {adminOwnerId = 758302152, ccbFlags = 140054588555265, client_hdl = 140055346832608}, impl_set = {r = {client_hdl = 140055346857416, impl_name = {size = 1, buf = 0x7f612d3264e0 ""}, impl_id = 0, scope = 0, oi_timeout = 758278200}, reply_dest = 0}, objModify = {ccbId = 758302152, adminOwnerId = 32609, objectName = {size = 1, buf = 0x7f612d3264e0 ""}, attrMods = 0x0, immHandle = 140055346833464}, ccbId = 758302152, admoId = 758302152, fevsReq = {sender_count = 140055346857416, reply_dest = 140054588555265, client_hdl = 140055346832608, msg = {size = 0, buf = 0x7f612d326838 "\200\364\062-a\177"}, isObjSync = 0 '\000'}, tmr_info = {type = 758302152, info = {immnd_dest = 140054588555265}}, mds_info = {change = 758302152, dest = 140054588555265, svc_id = 758277344, node_id = 32609, role = 0}, rda_info = {io_role = 758302152}, syncFevsBase = {fevsBase = 140055346857416, client_hdl = 140054588555265}, pbe2 = {epoch = 758302152, maxCcbId = 32609, maxCommitTime = 1, maxWeakCcbId = 140055346832608, maxWeakCommitTime = 0}}}}, sinfo = {to_svc = 749015968, dest = 140055344714965, stype = MDS_SENDTYPE_SNDRSP, ctxt = {length = 0 '\000', data = '\000' <repeats 11 times>, " "}, mSynReqCount = 143 '\217', pid = 32609, uid = 0, gid = 0}}
            out_evt = 0x0
            proc_rc = 1
            cl_node = 0x0
            locked = true
            requested_version = <optimized out>
            timeout_env_value = 0x0
            __FUNCTION__ = "initialize_common"
    #9  0x00007f612c824195 in saImmOiInitialize_2 (immOiHandle=immOiHandle@entry=0x6eab78 <_control_block+632>, immOiCallbacks=immOiCallbacks@entry=0x6e7100 <avd_callbacks>, inout_version=inout_version@entry=0x7f612a798e30) at imma_oi_api.c:81
            cl_node = 0x0
            rc = 4294966780
    #10 0x00000000004a94c6 in immutil_saImmOiInitialize_2 (immOiHandle=0x6eab78 <_control_block+632>, immOiCallbacks=0x6e7100 <avd_callbacks>, version=0x6e8d40 <immVersion>) at immutil.c:1121
            localVer = {releaseCode = 65 'A', majorVersion = 2 '\002', minorVersion = 15 '\017'}
            rc = <optimized out>
            nTries = <optimized out>
    #11 0x00000000004375ef in avd_imm_reinit_bg_thread (_cb=0x6ea900 <_control_block>) at imm.cc:1889
            evt = <optimized out>
            status = <optimized out>
            MAX_NO_RETRIES = 180
            __FUNCTION__ = "avd_imm_reinit_bg_thread"
            rc = <optimized out>
            cb = 0x6ea900 <_control_block>
            time = {tv_sec = 1, tv_nsec = 0}
            no_of_retries = 1
    #12 0x00007f612cc79184 in start_thread (arg=0x7f612a799700) at pthread_create.c:312
            __res = <optimized out>
            pd = 0x7f612a799700
            now = <optimized out>
            unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140055301166848, -8514593682340656909, 0, 0, 140055301167552, 140055301166848, 8580556700032747763, 8580570770333681907}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
            not_first_call = <optimized out>
            pagesize_m1 = <optimized out>
            sp = <optimized out>
            freesize = <optimized out>
            __PRETTY_FUNCTION__ = "start_thread"
    #13 0x00007f612b4db37d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
    No locals.
    
    Thread 4 (Thread 0x7f612d31cb00 (LWP 490)):
    #0  0x00007f612b4cdfdd in poll () at ../sysdeps/unix/syscall-template.S:81
    No locals.
    #1  0x00007f612cea78a1 in poll (__timeout=3900, __nfds=1, __fds=0x7f612d31c220) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46
    No locals.
    #2  osaf_ppoll (io_fds=io_fds@entry=0x7f612d31c220, i_nfds=i_nfds@entry=1, i_timeout_ts=0x7f612d31c240, i_sigmask=i_sigmask@entry=0x0) at osaf_poll.c:105
            current_time = {tv_sec = 28354416, tv_nsec = 39}
            elapsed_time = {tv_sec = 0, tv_nsec = 0}
            time_left = 3900
            start_time = {tv_sec = 1483636, tv_nsec = 247344122}
            time_left_ts = <optimized out>
            result = 900000
    #3  0x00007f612ceaf2ef in ncs_tmr_wait () at sysf_tmr.c:409
            rc = <optimized out>
            inds_rmvd = <optimized out>
            next_delay = <optimized out>
            tv = <optimized out>
            ts_current = {tv_sec = 1483636, tv_nsec = 247343759}
            ts = {tv_sec = 3, tv_nsec = 900000000}
            set = {fd = 5, events = 1, revents = 0}
    #4  0x00007f612cc79184 in start_thread (arg=0x7f612d31cb00) at pthread_create.c:312
            __res = <optimized out>
            pd = 0x7f612d31cb00
            now = <optimized out>
            unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140055346793216, -8514593682340656909, 1, 1, 140055346793920, 140055346793216, 8580567079223871731, 8580570770333681907}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
            not_first_call = <optimized out>
            pagesize_m1 = <optimized out>
            sp = <optimized out>
            freesize = <optimized out>
            __PRETTY_FUNCTION__ = "start_thread"
    #5  0x00007f612b4db37d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
    No locals.
    
    Thread 3 (Thread 0x7f612d2fcb00 (LWP 491)):
    #0  0x00007f612b4cdfdd in poll () at ../sysdeps/unix/syscall-template.S:81
    No locals.
    #1  0x00007f612cec9890 in poll (__timeout=20000, __nfds=2, __fds=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46
    No locals.
    #2  mdtm_process_recv_events_tcp () at mds_dt_trans.c:791
            pollres = <optimized out>
    #3  0x00007f612cc79184 in start_thread (arg=0x7f612d2fcb00) at pthread_create.c:312
            __res = <optimized out>
            pd = 0x7f612d2fcb00
            now = <optimized out>
            unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140055346662144, -8514593682340656909, 1, 1, 140055346662848, 140055346662144, 8580567268202432755, 8580570770333681907}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
            not_first_call = <optimized out>
            pagesize_m1 = <optimized out>
            sp = <optimized out>
            freesize = <optimized out>
            __PRETTY_FUNCTION__ = "start_thread"
    #4  0x00007f612b4db37d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
    No locals.
    
    Thread 2 (Thread 0x7f612d2dcb00 (LWP 492)):
    #0  __clock_nanosleep (clock_id=clock_id@entry=1, flags=flags@entry=1, req=req@entry=0x7f612d2dc1e0, rem=0x7f612b4e9974 <__clock_nanosleep+132>, rem@entry=0x0) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:49
            oldstate = 0
            r = 1
    #1  0x00007f612cea7b94 in osaf_nanosleep (sleep_duration=0x7f612bfc7ab0 <base::kOneSecond>) at osaf_time.c:39
            wakeup_time = {tv_sec = 1483637, tv_nsec = 232033850}
            retval = <optimized out>
    #2  0x00007f612bfc7450 in Sleep (duration=...) at ../../../../../osaf/libs/core/cplusplus/base/time.h:135
    No locals.
    #3  rda_callback_task (rda_callback_cb=0x1ae60a0) at rda_papi.cc:119
            msg = "10 2", '\000' <repeats 59 times>
            value = 2
            conn_lost = true
            rc = PCSRDA_RC_FATAL_IPC_CONNECTION_LOST
            retry_count = 0
            cmd_type = RDE_RDA_HA_ROLE
            cb_info = {cb_type = PCS_RDA_ROLE_CHG_IND, info = {io_role = PCS_RDA_STANDBY}}
    #4  0x00007f612cc79184 in start_thread (arg=0x7f612d2dcb00) at pthread_create.c:312
            __res = <optimized out>
            pd = 0x7f612d2dcb00
            now = <optimized out>
            unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140055346531072, -8514593682340656909, 1, 1, 140055346531776, 140055346531072, 8580567251022563571, 8580570770333681907}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
            not_first_call = <optimized out>
            pagesize_m1 = <optimized out>
            sp = <optimized out>
            freesize = <optimized out>
            __PRETTY_FUNCTION__ = "start_thread"
    #5  0x00007f612b4db37d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
    No locals.
    
    Thread 1 (Thread 0x7f612d31f780 (LWP 488)):
    #0  search (pTree=pTree@entry=0x7f612ca51440 <imma_cb+160>, key=key@entry=0x7ffc9d42b7d8 "\017\002\002") at patricia.c:93
            pNode = 0x0
            pPrevNode = <optimized out>
    #1  0x00007f612ceabf3f in ncs_patricia_tree_get (pTree=0x7f612ca51440 <imma_cb+160>, pKey=0x7ffc9d42b7d8 "\017\002\002") at patricia.c:433
            pNode = <optimized out>
    #2  0x00007f612c82b6f9 in imma_client_node_get (client_tree=<optimized out>, cl_hdl=<optimized out>, cl_node=0x7ffc9d42b7e8) at imma_db.c:55
    No locals.
    #3  0x00007f612c828af1 in saImmOiDispatch (immOiHandle=416611959311, dispatchFlags=SA_DISPATCH_ALL) at imma_oi_api.c:531
            rc = SA_AIS_OK
            cl_node = 0x0
            locked = true
            pend_fin = 0
            pend_dis = 0
            __FUNCTION__ = "saImmOiDispatch"
    #4  0x0000000000407310 in main_loop () at main.cc:726
            pollretval = <optimized out>
            evt = <optimized out>
            polltmo = <optimized out>
            term_fd = 13
            cb = 0x6ea900 <_control_block>
            error = <optimized out>
    #5  main (argc=<optimized out>, argv=<optimized out>) at main.cc:852
    

    I think by somehow we got to put the below change back, it ensures the OI handle invalidated in main thread (so won't get a Dispatch()) before starting another background thread

    @@ -1954,16 +1964,16 @@ void avd_imm_reinit_bg(void)
            int rc = 0;
    
    -       (void) saImmOiFinalize(avd_cb->immOiHandle);
    -
    -       avd_cb->immOiHandle = 0;
    -       avd_cb->is_implementer = false;
    -
            pthread_attr_init(&attr);
    
     
  • Minh Hon Chau

    Minh Hon Chau - 2016-11-28

    Or maybe we don't Dispatch() while avd_imm_status is ON_GOING?

     
  • Praveen

    Praveen - 2016-11-28

    Hi Minh,
    I guess this is happening in a very thin time window. When avd_imm_reinit_bg() spawns thread and comes out (thread has still not started finalization of handle). Now poll gets unblocked in main thread for imm event. So there will be race where thread will be finalizing the handle and Dispatch() also tries to use it.

    I thinnk one solution is what you have suggest above. Other could be:
    diff --git a/osaf/services/saf/amf/amfd/imm.cc b/osaf/services/saf/amf/amfd/imm.cc
    --- a/osaf/services/saf/amf/amfd/imm.cc
    +++ b/osaf/services/saf/amf/amfd/imm.cc
    @@ -1969,7 +1969,7 @@ void avd_imm_reinit_bg(void)
    return;
    }
    avd_cb->avd_imm_status = AVD_IMM_INIT_ONGOING;
    -
    + avd_cb->imm_sel_obj = -1;
    LOG_NO("Re-initializing with IMM");

        osaf_mutex_lock_ordie(&imm_reinit_thread_startup_mutex);
    

    We can use both also. Please share the traces if possbile. Based on that we can decide.

    Thanks,
    Praveen

     
  • Minh Hon Chau

    Minh Hon Chau - 2016-11-28

    Hi Praveen,

    I think your fix can solve the coredump, but I am a bit worried the other places, where immOiHandle is accessed in mainthread upon other events, eg. failover, report_admin_op_error(), immutil_saImmOiXXX. To be safe I guess we could put the litlle change back, since it does call saImmOiFinalize() first

    @@ -1954,16 +1964,16 @@ void avd_imm_reinit_bg(void)
            int rc = 0;
    
    -       (void) saImmOiFinalize(avd_cb->immOiHandle);
    -
    -       avd_cb->immOiHandle = 0;
    -       avd_cb->is_implementer = false;
    -
            pthread_attr_init(&attr);
    

    It was removed because it's executed twice in a row (second in avd_imm_reinit_bg_thread(), I guess. If that's not a big problem then can we leave it, or we can re-organize the while () loop in avd_imm_reinit_bg_thread()

    Thanks,
    Minh

     
  • Minh Hon Chau

    Minh Hon Chau - 2016-11-28

    Ah I will try to run the test and see, will post the trace if I hit it again

     
    • Minh Hon Chau

      Minh Hon Chau - 2016-11-28

      Retry the tests many times, but have not seen it

       
1 2 > >> (Page 1 of 2)

Log in to post a comment.