Menu

#1735 imm: IMMND doesn't handle error when sending finalize-sync message over fevs

4.6.2
invalid
None
defect
imm
nd
5.0 FC
major
2016-04-28
2016-04-07
No

Setup:
Changeset- 7436
Version - opensaf 5.0
4 nodes configured with single PBE

Issue observed:

1)It is observed after the scenario mentioned in ticket #1733

2) After reboot, Immnd failed to come up stating that NODE STATE-> IMM_NODE_ISOLATED and could not respawn properly.

3) PL-4 node didnot join the cluster after the multiple starts until cluster reset.

Following is the timestamp of payload PL-4:

Apr 6 10:52:52 OEL_M-SLOT-4 osafimmnd[1646]: Started
Apr 6 10:52:52 OEL_M-SLOT-4 osafimmnd[1646]: NO Persistent Back-End capability configured, Pbe file:imm.db (suffix may get added)
Apr 6 10:52:52 OEL_M-SLOT-4 osafimmnd[1646]: NO IMMD service is UP ... ScAbsenseAllowed?:0 introduced?:0
Apr 6 10:52:52 OEL_M-SLOT-4 osafimmnd[1646]: NO SERVER STATE: IMM_SERVER_ANONYMOUS --> IMM_SERVER_CLUSTER_WAITING
Apr 6 10:52:52 OEL_M-SLOT-4 osafimmnd[1646]: NO SETTING COORD TO 0 CLOUD PROTO
Apr 6 10:52:52 OEL_M-SLOT-4 osafimmnd[1646]: NO SERVER STATE: IMM_SERVER_CLUSTER_WAITING --> IMM_SERVER_LOADING_PENDING
Apr 6 10:52:52 OEL_M-SLOT-4 osafimmnd[1646]: NO SERVER STATE: IMM_SERVER_LOADING_PENDING --> IMM_SERVER_SYNC_PENDING
Apr 6 10:52:52 OEL_M-SLOT-4 osafimmnd[1646]: NO NODE STATE-> IMM_NODE_ISOLATED

Apr 6 11:00:52 OEL_M-SLOT-4 opensafd[1616]: ER Timed-out for response from IMMND
Apr 6 11:00:52 OEL_M-SLOT-4 opensafd[1616]: ER
Apr 6 11:00:52 OEL_M-SLOT-4 opensafd[1616]: ER Going for recovery
Apr 6 11:00:52 OEL_M-SLOT-4 opensafd[1616]: ER Trying To RESPAWN /usr/lib64/opensaf/clc-cli/osaf-immnd attempt #1
Apr 6 11:00:52 OEL_M-SLOT-4 opensafd[1616]: ER Sending SIGABRT to IMMND, pid=1646, (origin parent pid=1636)
Apr 6 11:01:07 OEL_M-SLOT-4 osafimmnd[1808]: Started
Apr 6 11:01:07 OEL_M-SLOT-4 osafimmnd[1808]: NO Persistent Back-End capability configured, Pbe file:imm.db (suffix may get added)
Apr 6 11:01:07 OEL_M-SLOT-4 osafimmnd[1808]: NO IMMD service is UP ... ScAbsenseAllowed?:0 introduced?:0
Apr 6 11:01:07 OEL_M-SLOT-4 osafimmnd[1808]: NO Fevs count adjusted to 51312 preLoadPid: 0
Apr 6 11:01:07 OEL_M-SLOT-4 osafimmnd[1808]: NO SERVER STATE: IMM_SERVER_ANONYMOUS --> IMM_SERVER_CLUSTER_WAITING
Apr 6 11:01:07 OEL_M-SLOT-4 osafimmnd[1808]: NO SETTING COORD TO 0 CLOUD PROTO
Apr 6 11:01:07 OEL_M-SLOT-4 osafimmnd[1808]: NO SERVER STATE: IMM_SERVER_CLUSTER_WAITING --> IMM_SERVER_LOADING_PENDING
Apr 6 11:01:07 OEL_M-SLOT-4 osafimmnd[1808]: NO SERVER STATE: IMM_SERVER_LOADING_PENDING --> IMM_SERVER_SYNC_PENDING
Apr 6 11:01:07 OEL_M-SLOT-4 osafimmnd[1808]: NO NODE STATE-> IMM_NODE_ISOLATED

Apr 6 11:09:07 OEL_M-SLOT-4 opensafd[1616]: ER Timed-out for response from IMMND
Apr 6 11:09:07 OEL_M-SLOT-4 opensafd[1616]: ER Could Not RESPAWN IMMND
Apr 6 11:09:07 OEL_M-SLOT-4 opensafd[1616]: ER
Apr 6 11:09:07 OEL_M-SLOT-4 opensafd[1616]: ER Trying To RESPAWN /usr/lib64/opensaf/clc-cli/osaf-immnd attempt #2
Apr 6 11:09:07 OEL_M-SLOT-4 opensafd[1616]: ER Sending SIGABRT to IMMND, pid=1808, (origin parent pid=1802)
Apr 6 11:09:22 OEL_M-SLOT-4 osafimmnd[1976]: Started
Apr 6 11:09:22 OEL_M-SLOT-4 osafimmnd[1976]: NO Persistent Back-End capability configured, Pbe file:imm.db (suffix may get added)
Apr 6 11:09:22 OEL_M-SLOT-4 osafimmnd[1976]: NO IMMD service is UP ... ScAbsenseAllowed?:0 introduced?:0
Apr 6 11:09:22 OEL_M-SLOT-4 osafimmnd[1976]: NO SERVER STATE: IMM_SERVER_ANONYMOUS --> IMM_SERVER_CLUSTER_WAITING
Apr 6 11:09:22 OEL_M-SLOT-4 osafimmnd[1976]: NO SETTING COORD TO 0 CLOUD PROTO
Apr 6 11:09:22 OEL_M-SLOT-4 osafimmnd[1976]: NO SERVER STATE: IMM_SERVER_CLUSTER_WAITING --> IMM_SERVER_LOADING_PENDING
Apr 6 11:09:22 OEL_M-SLOT-4 osafimmnd[1976]: NO SERVER STATE: IMM_SERVER_LOADING_PENDING --> IMM_SERVER_SYNC_PENDING
Apr 6 11:09:22 OEL_M-SLOT-4 osafimmnd[1976]: NO NODE STATE-> IMM_NODE_ISOLATED

Apr 6 11:10:05 OEL_M-SLOT-4 kernel: imklog 5.8.10, log source = /proc/kmsg started.
Apr 6 11:10:05 OEL_M-SLOT-4 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="1217" x-info="http://www.rsyslog.com"] start
Apr 6 11:10:05 OEL_M-SLOT-4 kernel: Initializing cgroup subsys cpuset
Apr 6 11:10:05 OEL_M-SLOT-4 kernel: Initializing cgroup subsys cpu
Apr 6 11:10:05 OEL_M-SLOT-4 kernel: Linux version 2.6.32.67 (root@SLOT-1) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) ) #1 SMP Wed Jul 29 11:31:39 I


When the sync server fails to send sync-finalize message, it keeps waiting for the sync-finalize message to come over fevs.

Apr 6 7:55:35.719179 osafimmnd [3728:immnd_proc.c:0840] TR Coord: Sync done, but waiting for confirmed finalizeSync. Out queue:0
Apr 6 7:55:35.819531 osafimmnd [3728:immnd_proc.c:0840] TR Coord: Sync done, but waiting for confirmed finalizeSync. Out queue:0
Apr 6 7:55:35.919867 osafimmnd [3728:immnd_proc.c:0840] TR Coord: Sync done, but waiting for confirmed finalizeSync. Out queue:0
Apr 6 7:55:36.020219 osafimmnd [3728:immnd_proc.c:0840] TR Coord: Sync done, but waiting for confirmed finalizeSync. Out queue:0

So it's stuck in IMM_SERVER_SYNC_SERVER state and can't handle new sync request from newly joined nodes. That's why PL-4 couldn't join the cluster. The coordinator can only announce sync when it's in IMM_SERVER_READY state.

2 Attachments

Related

Wiki: ChangeLog-4.6.2

Discussion

  • Hung Nguyen

    Hung Nguyen - 2016-04-12

    It seems that IMMND does nothing when it fails to send finalize-sync msg over fevs

    Apr 6 7:55:35.633000 osafimmnd [3728:immnd_evt.c:3113] T2 SENDING FEVS TO IMMD
    Apr 6 7:55:35.633007 osafimmnd [3728:immsv_evt.c:5454] T8 Sending: IMMD_EVT_ND2D_FEVS_REQ to 0
    Apr 6 7:55:35.646853 osafimmnd [3728:immnd_mds.c:0740] WA MDS Send Failed to service:IMMD rc:2
    Apr 6 7:55:35.646945 osafimmnd [3728:immnd_evt.c:3119] ER Problem in sending asyncronous FEVS message over MDS - dropping message!
    Apr 6 7:55:35.646961 osafimmnd [3728:immnd_evt.c:5764] T2 Failed send fevs message

     
  • Hung Nguyen

    Hung Nguyen - 2016-04-13
    • summary: Payload didnot join properly to the cluster after the reboot of payload. --> imm: IMMND doesn't handle error when sending finalize-sync message over fevs
    • status: unassigned --> accepted
    • assigned_to: Hung Nguyen
    • Part: - --> nd
     
  • Hung Nguyen

    Hung Nguyen - 2016-04-14
    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -60,3 +60,16 @@
     Apr  6 11:10:05 OEL_M-SLOT-4 kernel: Initializing cgroup subsys cpuset
     Apr  6 11:10:05 OEL_M-SLOT-4 kernel: Initializing cgroup subsys cpu
     Apr  6 11:10:05 OEL_M-SLOT-4 kernel: Linux version 2.6.32.67 (root@SLOT-1) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) ) #1 SMP Wed Jul 29 11:31:39 I
    +
    +
    +
    +-----
    +
    +When the sync server fails to send sync-finalize message, it keeps waiting for the sync-finalize message to come over fevs.
    +
    +Apr  6  7:55:35.719179 osafimmnd [3728:immnd_proc.c:0840] TR Coord: Sync done, but waiting for confirmed finalizeSync. Out queue:0
    +Apr  6  7:55:35.819531 osafimmnd [3728:immnd_proc.c:0840] TR Coord: Sync done, but waiting for confirmed finalizeSync. Out queue:0
    +Apr  6  7:55:35.919867 osafimmnd [3728:immnd_proc.c:0840] TR Coord: Sync done, but waiting for confirmed finalizeSync. Out queue:0
    +Apr  6  7:55:36.020219 osafimmnd [3728:immnd_proc.c:0840] TR Coord: Sync done, but waiting for confirmed finalizeSync. Out queue:0
    +
    +So it's stuck in IMM_SERVER_SYNC_SERVER state and can't handle new sync request from newly joined nodes. That's why PL-4 couldn't join the cluster. The coordinator can only announce sync when it's in IMM_SERVER_READY state.
    
     
  • Hung Nguyen

    Hung Nguyen - 2016-04-14
    • status: accepted --> review
     
  • Neelakanta Reddy

    SC-1 is active

    1. page allocation failure on SC-1

    Apr 6 07:55:33 OEL_M-SLOT-1 osafimmd[6916]: NO Node 2030f request sync sync-pid:3259 epoch:0
    Apr 6 07:55:34 OEL_M-SLOT-1 osafimmnd[3728]: NO Announce sync, epoch:81
    Apr 6 07:55:34 OEL_M-SLOT-1 osafimmnd[3728]: NO SERVER STATE: IMM_SERVER_READY --> IMM_SERVER_SYNC_SERVER
    Apr 6 07:55:34 OEL_M-SLOT-1 osafimmnd[3728]: NO NODE STATE-> IMM_NODE_R_AVAILABLE
    Apr 6 07:55:35 OEL_M-SLOT-1 kernel: osafimmnd: page allocation failure. order:4, mode:0x20
    Apr 6 07:55:35 OEL_M-SLOT-1 kernel: Pid: 3728, comm: osafimmnd Not tainted 2.6.32.67 #1
    Apr 6 07:55:35 OEL_M-SLOT-1 kernel: Call Trace:
    Apr 6 07:55:35 OEL_M-SLOT-1 kernel: [<ffffffff81106adc>] alloc_pages_nodemask+0x57c/0x6c0
    Apr 6 07:55:35 OEL_M-SLOT-1 kernel: [<ffffffff8113f3c2>] kmem_getpages+0x62/0x170
    Apr 6 07:55:35 OEL_M-SLOT-1 kernel: [<ffffffff8113ff7a>] fallback_alloc+0x19a/0x240
    Apr 6 07:55:35 OEL_M-SLOT-1 kernel: [<ffffffff8113fa1b>] ? cache_grow+0x2bb/0x2f0
    Apr 6 07:55:35 OEL_M-SLOT-1 kernel: [<ffffffff8113fd25>]
    cache_alloc_node+0x95/0x150
    Apr 6 07:55:35 OEL_M-SLOT-1 kernel: [<ffffffff813e0f8a>] ?
    alloc_skb+0x7a/0x180
    Apr 6 07:55:35 OEL_M-SLOT-1 kernel: [<ffffffff81140b5f>] kmem_cache_alloc_node_notrace+0x6f/0x130
    Apr 6 07:55:35 OEL_M-SLOT-1 kernel: [<ffffffff81140d9b>] kmalloc_node+0x7b/0x100
    Apr 6 07:55:35 OEL_M-SLOT-1 kernel: [<ffffffff813e0f8a>]
    alloc_skb+0x7a/0x180
    Apr 6 07:55:35 OEL_M-SLOT-1 kernel: [<ffffffffa04d7bdb>] buf_acquire+0x2b/0x60 [tipc]
    Apr 6 07:55:35 OEL_M-SLOT-1 kernel: [<ffffffffa04de077>] tipc_msg_build+0x97/0x210 [tipc]

    1. sync failed:

    Apr 6 07:55:35 OEL_M-SLOT-1 osafmsgd[7099]: ER Sending the message to the specified destination with error 6
    Apr 6 07:55:35 OEL_M-SLOT-1 osafmsgd[7099]: ER ERR_FAILED_OPERATION: Couldn't Send ASAPi Name Resolution Response Message
    Apr 6 07:55:35 OEL_M-SLOT-1 osafimmnd[3728]: WA MDS Send Failed to service:IMMD rc:2
    Apr 6 07:55:35 OEL_M-SLOT-1 osafimmnd[3728]: ER Problem in sending asyncronous FEVS message over MDS - dropping message!
    Apr 6 07:55:35 OEL_M-SLOT-1 osafimmloadd: ER immsv_finalize_sync failed!
    Apr 6 07:55:35 OEL_M-SLOT-1 osafimmloadd: ER Sync ending ABNORMALLY

    1. sending the message through TIPC is failed.

    Apr 6 7:42:59.850380 mqsv_a[10430] ERR |MDS_SND_RCV: Timeout or Error occured
    Apr 6 7:42:59.850465 mqsv_a[10430] ERR |MDS_SND_RCV: Timeout occured on sndrsp message
    Apr 6 7:42:59.850500 mqsv_a[10430] ERR |MDS_SND_RCV: Adest=<0x00000000,10>

    There is no logging in mds

    Apr 6 7:43:21.873375 mqsv_a[10430] ERR |MDS_SND_RCV: Timeout or Error occured
    Apr 6 7:43:21.873530 mqsv_a[10430] ERR |MDS_SND_RCV: Timeout occured on sndrsp message
    Apr 6 7:43:21.873588 mqsv_a[10430] ERR |MDS_SND_RCV: Adest=<0x00000000,10>
    Apr 6 10:19:08.276857 immlist[13976] NOTIFY |BEGIN MDS LOGGING| PID=<immlist<span>[13976]> | ARCHW=a|64bit=1
    Apr 6 10:19:08.283659 immlist[13976] NOTIFY |MDTM: install_tipc : svc_id = IMMA_OM(26), vdest=65535
    Apr 6 10:19:08.283815 immlist[13976] NOTIFY |MDTM: svc up event for svc_id = IMMND(25), subscri. by svc_id = IMMA_OM(26) pwe_id=1 Adest = <rem_node<span>[3]:dest_tipc_id_ref[3854262298]>

    There seems to be problem with system resources, and the system is unable to send messages through transport.

     
  • Neelakanta Reddy

    • Attachments has changed:

    Diff:

    --- old
    +++ new
    @@ -1 +1,2 @@
     immnd.tgz (41.2 MB; application/octet-stream)
    +syslog.tgz (627.8 kB; application/x-compressed-tar)
    
     
  • Neelakanta Reddy

    Attached the syslog of the controller and payload.

     
  • Hung Nguyen

    Hung Nguyen - 2016-04-28
    • status: review --> invalid
     
  • Hung Nguyen

    Hung Nguyen - 2016-04-28

    Close the ticket since it's a problem with mds.
    Will reopen the ticket if the problem is observed again.

     

Log in to post a comment.