Setup:
Changeset- 7436
Version - opensaf 5.0
4 nodes configured with single PBE
Issue observed:
1)It is observed after the scenario mentioned in ticket #1733
2) After reboot, Immnd failed to come up stating that NODE STATE-> IMM_NODE_ISOLATED and could not respawn properly.
3) PL-4 node didnot join the cluster after the multiple starts until cluster reset.
Following is the timestamp of payload PL-4:
Apr 6 10:52:52 OEL_M-SLOT-4 osafimmnd[1646]: Started
Apr 6 10:52:52 OEL_M-SLOT-4 osafimmnd[1646]: NO Persistent Back-End capability configured, Pbe file:imm.db (suffix may get added)
Apr 6 10:52:52 OEL_M-SLOT-4 osafimmnd[1646]: NO IMMD service is UP ... ScAbsenseAllowed?:0 introduced?:0
Apr 6 10:52:52 OEL_M-SLOT-4 osafimmnd[1646]: NO SERVER STATE: IMM_SERVER_ANONYMOUS --> IMM_SERVER_CLUSTER_WAITING
Apr 6 10:52:52 OEL_M-SLOT-4 osafimmnd[1646]: NO SETTING COORD TO 0 CLOUD PROTO
Apr 6 10:52:52 OEL_M-SLOT-4 osafimmnd[1646]: NO SERVER STATE: IMM_SERVER_CLUSTER_WAITING --> IMM_SERVER_LOADING_PENDING
Apr 6 10:52:52 OEL_M-SLOT-4 osafimmnd[1646]: NO SERVER STATE: IMM_SERVER_LOADING_PENDING --> IMM_SERVER_SYNC_PENDING
Apr 6 10:52:52 OEL_M-SLOT-4 osafimmnd[1646]: NO NODE STATE-> IMM_NODE_ISOLATED
Apr 6 11:00:52 OEL_M-SLOT-4 opensafd[1616]: ER Timed-out for response from IMMND
Apr 6 11:00:52 OEL_M-SLOT-4 opensafd[1616]: ER
Apr 6 11:00:52 OEL_M-SLOT-4 opensafd[1616]: ER Going for recovery
Apr 6 11:00:52 OEL_M-SLOT-4 opensafd[1616]: ER Trying To RESPAWN /usr/lib64/opensaf/clc-cli/osaf-immnd attempt #1
Apr 6 11:00:52 OEL_M-SLOT-4 opensafd[1616]: ER Sending SIGABRT to IMMND, pid=1646, (origin parent pid=1636)
Apr 6 11:01:07 OEL_M-SLOT-4 osafimmnd[1808]: Started
Apr 6 11:01:07 OEL_M-SLOT-4 osafimmnd[1808]: NO Persistent Back-End capability configured, Pbe file:imm.db (suffix may get added)
Apr 6 11:01:07 OEL_M-SLOT-4 osafimmnd[1808]: NO IMMD service is UP ... ScAbsenseAllowed?:0 introduced?:0
Apr 6 11:01:07 OEL_M-SLOT-4 osafimmnd[1808]: NO Fevs count adjusted to 51312 preLoadPid: 0
Apr 6 11:01:07 OEL_M-SLOT-4 osafimmnd[1808]: NO SERVER STATE: IMM_SERVER_ANONYMOUS --> IMM_SERVER_CLUSTER_WAITING
Apr 6 11:01:07 OEL_M-SLOT-4 osafimmnd[1808]: NO SETTING COORD TO 0 CLOUD PROTO
Apr 6 11:01:07 OEL_M-SLOT-4 osafimmnd[1808]: NO SERVER STATE: IMM_SERVER_CLUSTER_WAITING --> IMM_SERVER_LOADING_PENDING
Apr 6 11:01:07 OEL_M-SLOT-4 osafimmnd[1808]: NO SERVER STATE: IMM_SERVER_LOADING_PENDING --> IMM_SERVER_SYNC_PENDING
Apr 6 11:01:07 OEL_M-SLOT-4 osafimmnd[1808]: NO NODE STATE-> IMM_NODE_ISOLATED
Apr 6 11:09:07 OEL_M-SLOT-4 opensafd[1616]: ER Timed-out for response from IMMND
Apr 6 11:09:07 OEL_M-SLOT-4 opensafd[1616]: ER Could Not RESPAWN IMMND
Apr 6 11:09:07 OEL_M-SLOT-4 opensafd[1616]: ER
Apr 6 11:09:07 OEL_M-SLOT-4 opensafd[1616]: ER Trying To RESPAWN /usr/lib64/opensaf/clc-cli/osaf-immnd attempt #2
Apr 6 11:09:07 OEL_M-SLOT-4 opensafd[1616]: ER Sending SIGABRT to IMMND, pid=1808, (origin parent pid=1802)
Apr 6 11:09:22 OEL_M-SLOT-4 osafimmnd[1976]: Started
Apr 6 11:09:22 OEL_M-SLOT-4 osafimmnd[1976]: NO Persistent Back-End capability configured, Pbe file:imm.db (suffix may get added)
Apr 6 11:09:22 OEL_M-SLOT-4 osafimmnd[1976]: NO IMMD service is UP ... ScAbsenseAllowed?:0 introduced?:0
Apr 6 11:09:22 OEL_M-SLOT-4 osafimmnd[1976]: NO SERVER STATE: IMM_SERVER_ANONYMOUS --> IMM_SERVER_CLUSTER_WAITING
Apr 6 11:09:22 OEL_M-SLOT-4 osafimmnd[1976]: NO SETTING COORD TO 0 CLOUD PROTO
Apr 6 11:09:22 OEL_M-SLOT-4 osafimmnd[1976]: NO SERVER STATE: IMM_SERVER_CLUSTER_WAITING --> IMM_SERVER_LOADING_PENDING
Apr 6 11:09:22 OEL_M-SLOT-4 osafimmnd[1976]: NO SERVER STATE: IMM_SERVER_LOADING_PENDING --> IMM_SERVER_SYNC_PENDING
Apr 6 11:09:22 OEL_M-SLOT-4 osafimmnd[1976]: NO NODE STATE-> IMM_NODE_ISOLATED
Apr 6 11:10:05 OEL_M-SLOT-4 kernel: imklog 5.8.10, log source = /proc/kmsg started.
Apr 6 11:10:05 OEL_M-SLOT-4 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="1217" x-info="http://www.rsyslog.com"] start
Apr 6 11:10:05 OEL_M-SLOT-4 kernel: Initializing cgroup subsys cpuset
Apr 6 11:10:05 OEL_M-SLOT-4 kernel: Initializing cgroup subsys cpu
Apr 6 11:10:05 OEL_M-SLOT-4 kernel: Linux version 2.6.32.67 (root@SLOT-1) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) ) #1 SMP Wed Jul 29 11:31:39 I
When the sync server fails to send sync-finalize message, it keeps waiting for the sync-finalize message to come over fevs.
Apr 6 7:55:35.719179 osafimmnd [3728:immnd_proc.c:0840] TR Coord: Sync done, but waiting for confirmed finalizeSync. Out queue:0
Apr 6 7:55:35.819531 osafimmnd [3728:immnd_proc.c:0840] TR Coord: Sync done, but waiting for confirmed finalizeSync. Out queue:0
Apr 6 7:55:35.919867 osafimmnd [3728:immnd_proc.c:0840] TR Coord: Sync done, but waiting for confirmed finalizeSync. Out queue:0
Apr 6 7:55:36.020219 osafimmnd [3728:immnd_proc.c:0840] TR Coord: Sync done, but waiting for confirmed finalizeSync. Out queue:0
So it's stuck in IMM_SERVER_SYNC_SERVER state and can't handle new sync request from newly joined nodes. That's why PL-4 couldn't join the cluster. The coordinator can only announce sync when it's in IMM_SERVER_READY state.
It seems that IMMND does nothing when it fails to send finalize-sync msg over fevs
Apr 6 7:55:35.633000 osafimmnd [3728:immnd_evt.c:3113] T2 SENDING FEVS TO IMMD
Apr 6 7:55:35.633007 osafimmnd [3728:immsv_evt.c:5454] T8 Sending: IMMD_EVT_ND2D_FEVS_REQ to 0
Apr 6 7:55:35.646853 osafimmnd [3728:immnd_mds.c:0740] WA MDS Send Failed to service:IMMD rc:2
Apr 6 7:55:35.646945 osafimmnd [3728:immnd_evt.c:3119] ER Problem in sending asyncronous FEVS message over MDS - dropping message!
Apr 6 7:55:35.646961 osafimmnd [3728:immnd_evt.c:5764] T2 Failed send fevs message
Diff:
SC-1 is active
Apr 6 07:55:33 OEL_M-SLOT-1 osafimmd[6916]: NO Node 2030f request sync sync-pid:3259 epoch:0
Apr 6 07:55:34 OEL_M-SLOT-1 osafimmnd[3728]: NO Announce sync, epoch:81
Apr 6 07:55:34 OEL_M-SLOT-1 osafimmnd[3728]: NO SERVER STATE: IMM_SERVER_READY --> IMM_SERVER_SYNC_SERVER
Apr 6 07:55:34 OEL_M-SLOT-1 osafimmnd[3728]: NO NODE STATE-> IMM_NODE_R_AVAILABLE
Apr 6 07:55:35 OEL_M-SLOT-1 kernel: osafimmnd: page allocation failure. order:4, mode:0x20
Apr 6 07:55:35 OEL_M-SLOT-1 kernel: Pid: 3728, comm: osafimmnd Not tainted 2.6.32.67 #1
Apr 6 07:55:35 OEL_M-SLOT-1 kernel: Call Trace:
Apr 6 07:55:35 OEL_M-SLOT-1 kernel: [<ffffffff81106adc>] alloc_pages_nodemask+0x57c/0x6c0
Apr 6 07:55:35 OEL_M-SLOT-1 kernel: [<ffffffff8113f3c2>] kmem_getpages+0x62/0x170
Apr 6 07:55:35 OEL_M-SLOT-1 kernel: [<ffffffff8113ff7a>] fallback_alloc+0x19a/0x240
Apr 6 07:55:35 OEL_M-SLOT-1 kernel: [<ffffffff8113fa1b>] ? cache_grow+0x2bb/0x2f0
Apr 6 07:55:35 OEL_M-SLOT-1 kernel: [<ffffffff8113fd25>] cache_alloc_node+0x95/0x150
Apr 6 07:55:35 OEL_M-SLOT-1 kernel: [<ffffffff813e0f8a>] ? alloc_skb+0x7a/0x180
Apr 6 07:55:35 OEL_M-SLOT-1 kernel: [<ffffffff81140b5f>] kmem_cache_alloc_node_notrace+0x6f/0x130
Apr 6 07:55:35 OEL_M-SLOT-1 kernel: [<ffffffff81140d9b>] kmalloc_node+0x7b/0x100
Apr 6 07:55:35 OEL_M-SLOT-1 kernel: [<ffffffff813e0f8a>] alloc_skb+0x7a/0x180
Apr 6 07:55:35 OEL_M-SLOT-1 kernel: [<ffffffffa04d7bdb>] buf_acquire+0x2b/0x60 [tipc]
Apr 6 07:55:35 OEL_M-SLOT-1 kernel: [<ffffffffa04de077>] tipc_msg_build+0x97/0x210 [tipc]
Apr 6 07:55:35 OEL_M-SLOT-1 osafmsgd[7099]: ER Sending the message to the specified destination with error 6
Apr 6 07:55:35 OEL_M-SLOT-1 osafmsgd[7099]: ER ERR_FAILED_OPERATION: Couldn't Send ASAPi Name Resolution Response Message
Apr 6 07:55:35 OEL_M-SLOT-1 osafimmnd[3728]: WA MDS Send Failed to service:IMMD rc:2
Apr 6 07:55:35 OEL_M-SLOT-1 osafimmnd[3728]: ER Problem in sending asyncronous FEVS message over MDS - dropping message!
Apr 6 07:55:35 OEL_M-SLOT-1 osafimmloadd: ER immsv_finalize_sync failed!
Apr 6 07:55:35 OEL_M-SLOT-1 osafimmloadd: ER Sync ending ABNORMALLY
Apr 6 7:42:59.850380 mqsv_a[10430] ERR |MDS_SND_RCV: Timeout or Error occured
Apr 6 7:42:59.850465 mqsv_a[10430] ERR |MDS_SND_RCV: Timeout occured on sndrsp message
Apr 6 7:42:59.850500 mqsv_a[10430] ERR |MDS_SND_RCV: Adest=<0x00000000,10>
There is no logging in mds
Apr 6 7:43:21.873375 mqsv_a[10430] ERR |MDS_SND_RCV: Timeout or Error occured
Apr 6 7:43:21.873530 mqsv_a[10430] ERR |MDS_SND_RCV: Timeout occured on sndrsp message
Apr 6 7:43:21.873588 mqsv_a[10430] ERR |MDS_SND_RCV: Adest=<0x00000000,10>
Apr 6 10:19:08.276857 immlist[13976] NOTIFY |BEGIN MDS LOGGING| PID=<immlist<span>[13976]> | ARCHW=a|64bit=1
Apr 6 10:19:08.283659 immlist[13976] NOTIFY |MDTM: install_tipc : svc_id = IMMA_OM(26), vdest=65535
Apr 6 10:19:08.283815 immlist[13976] NOTIFY |MDTM: svc up event for svc_id = IMMND(25), subscri. by svc_id = IMMA_OM(26) pwe_id=1 Adest = <rem_node<span>[3]:dest_tipc_id_ref[3854262298]>
There seems to be problem with system resources, and the system is unable to send messages through transport.
Diff:
Attached the syslog of the controller and payload.
Close the ticket since it's a problem with mds.
Will reopen the ticket if the problem is observed again.