Menu

#2325 clm: standby clmd crashed after failing to read node configuration from IMM.

5.0.2
fixed
Praveen
None
defect
clm
d
major
2017-03-10
2017-02-24
Praveen
No

Issue is not reproducible.
While coming up as standby, CLMD successfully initializes with IMM. It successfuly reads cluster related configuration. While reading node related configuration from IMM, CLMD make a calls to saImmOmSearchNext_2(). This API could not send any message to IMMND and failed:
Feb 15 06:32:17 SC-2-2 osafclmd[3972]: WA OpenSAF imm lib: Message loss detected for dest 565213425675031 service id:25
Feb 15 06:32:17 SC-2-2 osafimmnd[3930]: WA IMMND - Client Node Get Failed for cli_hdl:932008034831
Feb 15 06:32:17 SC-2-2 osafclmd[3972]: WA OpenSAF imm lib: Message loss detected for dest 565213425675031 service id:25
Feb 15 06:32:17 SC-2-2 osafclmd[3972]: WA marking handle as exposed

CLMD does not explicitly check whether node config read was sucessful or not. It comes and completes the cold sync. When a payload joins the cluster, active CLMD checkpoints run time data for the node. Since node is not present on standby CLMD, it crashes:

Feb 15 06:33:26 SC-2-2 osafimmd[3915]: NO SBY: New Epoch for IMMND process at node 2020f old epoch: 22 new epoch:23
Feb 15 06:33:26 SC-2-2 osafclmd[3972]: ER Node is NULL,problem with the database.
Feb 15 06:33:26 SC-2-2 osafclmd[3972]: ../../opensaf/src/clm/clmd/clms_mbcsv.c:468: ckpt_proc_node_rec: Assertion '0' failed.
Feb 15 06:33:27 SC-2-2 osafamfnd[4002]: NO 'safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast'

Related

Tickets: #2325
Wiki: ChangeLog-5.0.2
Wiki: ChangeLog-5.1.1

Discussion

  • Praveen

    Praveen - 2017-02-24
    • Milestone: 5.2.FC --> 5.0.2
     
  • Praveen

    Praveen - 2017-03-03
    • status: accepted --> review
     
  • Praveen

    Praveen - 2017-03-03

    It seems to be the case where IMM is returning BAD_HANDLE :
    Feb 15 6:32:17.783637 osafclmd [3972:../../opensaf/src/imm/agent/imma_om_api.cc:7429] << search_init_common
    Feb 15 6:32:17.784315 osafclmd [3972:../../opensaf/src/imm/agent/imma_mds.cc:0673] WA OpenSAF imm lib: Message loss detected for dest 565213425675031 service id:25
    Feb 15 6:32:17.784330 osafclmd [3972:../../opensaf/src/imm/agent/imma_db.cc:0631] >> imma_mark_clients_stale
    Feb 15 6:32:17.784338 osafclmd [3972:../../opensaf/src/imm/agent/imma_db.cc:0674] TR Search id 218 for handle d90002020f closed for stale imm-handle
    Feb 15 6:32:17.784353 osafclmd [3972:../../opensaf/src/imm/agent/imma_db.cc:0683] WA marking handle as exposed
    Feb 15 6:32:17.784359 osafclmd [3972:../../opensaf/src/imm/agent/imma_db.cc:0689] TR Stale marked client cl:217 node:2020f
    Feb 15 6:32:17.784366 osafclmd [3972:../../opensaf/src/imm/agent/imma_db.cc:0775] >> isExposed
    Feb 15 6:32:17.784371 osafclmd [3972:../../opensaf/src/imm/agent/imma_db.cc:0836] TR isExposed Returning Exposed:1
    Feb 15 6:32:17.784377 osafclmd [3972:../../opensaf/src/imm/agent/imma_db.cc:0837] << isExposed
    Feb 15 6:32:17.784383 osafclmd [3972:../../opensaf/src/imm/agent/imma_db.cc:0704] << imma_mark_clients_stale
    Feb 15 6:32:17.785329 osafclmd [3972:../../opensaf/src/imm/agent/imma_om_api.cc:7630] T3 ERR_BAD_HANDLE: client is stale and exposed
    Feb 15 6:32:17.785359 osafclmd [3972:../../opensaf/src/imm/agent/imma_om_api.cc:7807] >> saImmOmSearchFinalize
    Feb 15 6:32:17.785367 osafclmd [3972:../../opensaf/src/imm/agent/imma_om_api.cc:7865] T1 IMM Handle d90002020f is stale
    Feb 15 6:32:17.785376 osafclmd [3972:../../opensaf/src/imm/agent/imma_om_api.cc:7965] << saImmOmSearchFinalize
    Feb 15 6:32:17.785384 osafclmd [3972:../../opensaf/src/imm/agent/imma_om_api.cc:0735] >> saImmOmFinalize
    Feb 15 6:32:17.785391 osafclmd [3972:../../opensaf/src/imm/agent/imma_om_api.cc:0769] T1 Handle d90002020f is stale
    Feb 15 6:32:17.785398 osafclmd [3972:../../opensaf/src/imm/agent/imma_om_api.cc:0867] T3 Handle d90002020f is stale
    Feb 15 6:32:17.785405 osafclmd [3972:../../opensaf/src/imm/agent/imma_proc.cc:0147] >> imma_callback_ipc_destroy
    Feb 15 6:32:17.785412 osafclmd [3972:../../opensaf/src/imm/agent/imma_proc.cc:0000] << imma_callback_ipc_destroy
    Feb 15 6:32:17.785417 osafclmd [3972:../../opensaf/src/imm/agent/imma_proc.cc:0206] TR Deleting client node
    Feb 15 6:32:17.785424 osafclmd [3972:../../opensaf/src/imm/agent/imma_init.cc:0326] >> imma_shutdown

     
  • Praveen

    Praveen - 2017-03-10
    • status: review --> fixed
     
  • Praveen

    Praveen - 2017-03-10

    changeset: 8682:50a2033a8a8d
    branch: opensaf-5.0.x
    parent: 8679:7ec6c15c249f
    user: Praveen Malviya praveen.malviya@oracle.com
    date: Fri Mar 10 10:48:17 2017 +0530
    summary: clmd: try to re-read node config from IMM if BAD_HANDLE is returned [#2325].

    changeset: 8683:59e265654232
    branch: opensaf-5.1.x
    parent: 8680:e02390320bbb
    user: Praveen Malviya praveen.malviya@oracle.com
    date: Fri Mar 10 10:49:06 2017 +0530
    summary: clmd: try to re-read node config from IMM if BAD_HANDLE is returned [#2325].

    changeset: 8684:9338ad3cacc0
    tag: tip
    parent: 8681:0e9c5da42416
    user: Praveen Malviya praveen.malviya@oracle.com
    date: Fri Mar 10 10:49:44 2017 +0530
    summary: clmd: try to re-read node config from IMM if BAD_HANDLE is returned [#2325].

     

    Related

    Tickets: #2325


Log in to post a comment.