[OSR-users] Problem with rgmanager

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi,

I am happening a strange thing. I created a cluster with two nodes,  
clu01 and clu02,
with the Shared-Root on a SAN. The node clu01 has the IP address  
10.43.100.203

<clusternode name="clu01" votes="1" nodeid="1">
   <com_info>
     <syslog name="clu01"/>
     <rootvolume name="/dev/sda2" fstype="ocfs2"/>
     <eth name="eth0" ip="10.43.100.203" mac="00:15:60:56:75:FD"/>
     <fenceackserver user="root" passwd="test123"/>
   </com_info>
</clusternode>

I also configured the service Httpd on the cluster and everything  
worked well.
I had to change IP address (10.43.105.10) to the node_1 and so I  
preferred to do the procedure again,
formatting the Shared-Root but not the server clu01.
The cluster starts with the new IP address and when I am starting  
rgmanager:

/etc/init.d/rgmanager strat

everything seems ok
but in the log file I read:

Jun 27 10:13:14 clu01 kernel: dlm: Using TCP for communications
Jun 27 10:13:14 clu01 kernel: dlm: Can't create listening comms socket
Jun 27 10:13:14 clu01 kernel: dlm: cannot start dlm lowcomms -98

and the output of command :

clustat
Cluster Status for cluOCFS2 @ Wed Jul  1 13:35:10 2009
Member Status: Quorate

  Member Name                                                     ID    
Status
  ------ ----                                                     ----  
------
  clu01                                                                
1 Online, Local
  clu02                                                                
2 Offline

missing part on the service.
if I try to make the restart of rgmanager, the log is:

Jun 28 04:02:08 clu01 syslogd 1.4.1: restart.
Jul  1 13:37:31 clu01 kernel: dlm: Using TCP for communications
Jul  1 13:37:31 clu01 kernel: dlm: Can't create listening comms socket
Jul  1 13:37:41 clu01 kernel: BUG: soft lockup - CPU#0 stuck for 10s!  
[clurgmgrd:13230]
Jul  1 13:37:41 clu01 kernel:
Jul  1 13:37:41 clu01 kernel: Pid: 13230, comm:            clurgmgrd
Jul  1 13:37:41 clu01 kernel: EIP: 0060:[<c0608d90>] CPU: 0
Jul  1 13:37:41 clu01 kernel: EIP is at _spin_lock+0x7/0xf
Jul  1 13:37:41 clu01 kernel:  EFLAGS: 00000286    Tainted: G        
(2.6.18-92.1.22.el5PAE #1)
Jul  1 13:37:41 clu01 kernel: EAX: f1d93a98 EBX: f1d93a94 ECX:  
00000000 EDX: e1958000
Jul  1 13:37:41 clu01 kernel: ESI: f1d93a94 EDI: f1e31000 EBP:  
e1958ebc DS: 007b ES: 007b
Jul  1 13:37:41 clu01 kernel: CR0: 8005003b CR2: b7f48000 CR3:  
37caef00 CR4: 000006f0
Jul  1 13:37:41 clu01 kernel:  [<c06080ef>] __mutex_lock_slowpath 
+0x19/0x7c
Jul  1 13:37:41 clu01 kernel:  [<c0608161>] .text.lock.mutex+0xf/0x14
Jul  1 13:37:41 clu01 kernel:  [<f8c2ff6b>] close_connection+0x11/0x5a  
[dlm]
Jul  1 13:37:41 clu01 kernel:  [<f8c308fd>] dlm_lowcomms_start+0x53e/ 
0x59c [dlm]
Jul  1 13:37:41 clu01 kernel:  [<c06076a4>] schedule+0x920/0x9cd
Jul  1 13:37:41 clu01 kernel:  [<f8c2e879>] dlm_new_lockspace 
+0x87/0x742 [dlm]
Jul  1 13:37:41 clu01 kernel:  [<f8c33d38>] device_write+0x310/0x4b6  
[dlm]
Jul  1 13:37:41 clu01 kernel:  [<f8c33a28>] device_write+0x0/0x4b6 [dlm]
Jul  1 13:37:41 clu01 kernel:  [<c0470283>] vfs_write+0xa1/0x143
Jul  1 13:37:41 clu01 kernel:  [<c0470875>] sys_write+0x3c/0x63
Jul  1 13:37:41 clu01 kernel:  [<c0404eff>] syscall_call+0x7/0xb
Jul  1 13:37:41 clu01 kernel:  =======================

if I change the file cluster.conf, put back the old IP (10.43.100.203)  
and create a new initrd,
rgmanager works well.
This happens even with the same IP subnet 10.43.100, in practice it  
seems that it works only
with the single IP address with which it was originally created the  
cluster !

Thanks.

Ing. Stefano Elmopi
Gruppo Darco - Area ICT Sistemi
Via Ostiense 131/L Corpo B, 00154 Roma

cell. 3466147165
tel.  0657060500
email:ste...@so...