From: Klaus S. <kla...@Ph...> - 2009-04-16 08:44:38
Attachments:
smime.p7s
klaus_steinberger.vcf
|
Hello, we got a trouble with our OSR Cluster (SL 5.3), sometimes a node runs unexpectedly out of memory and goes wild. For example I just started a mkinitrd and during this process the node goes unresponsive. Looking at the console I saw many "Out of memory" messages and killing some processes. Our current configuration: Three nodes running SL 5.3 Each node's memory is limited to 2 GByte, due to the fact that we run XEN on top of it (we had bad experiences with ballooning dom0, so we limited the memory) Normally the nodes need around 1 GByte: [root@aule ~]# cat /proc/meminfo MemTotal: 2097152 kB MemFree: 1197244 kB Buffers: 7408 kB Cached: 335876 kB SwapCached: 0 kB Active: 127860 kB Inactive: 308336 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 2097152 kB LowFree: 1197244 kB SwapTotal: 0 kB SwapFree: 0 kB Dirty: 152 kB Writeback: 0 kB AnonPages: 92852 kB Mapped: 22608 kB Slab: 93820 kB PageTables: 5256 kB NFS_Unstable: 0 kB Bounce: 0 kB CommitLimit: 1048576 kB Committed_AS: 1747212 kB VmallocTotal: 34359738367 kB VmallocUsed: 4828 kB VmallocChunk: 34359733487 kB [root@aule ~]# What would be the best way to add SWAP to a node? Or any other idea? Or hust extende the dom0 Memory? Sincerly, Klaus Steinberger |
From: Marc G. <gr...@at...> - 2009-04-16 09:09:59
|
On Thursday 16 April 2009 10:44:25 Klaus Steinberger wrote: > Hello, > > we got a trouble with our OSR Cluster (SL 5.3), sometimes a node runs > unexpectedly out of memory and goes wild. > > For example I just started a mkinitrd and during this process the node > goes unresponsive. Looking at the console I saw many "Out of memory" > messages and killing some processes. > > Our current configuration: > > Three nodes running SL 5.3 > > Each node's memory is limited to 2 GByte, due to the fact that we run > XEN on top of it (we had bad experiences with ballooning dom0, so we > limited the memory) > > Normally the nodes need around 1 GByte: > > [root@aule ~]# cat /proc/meminfo > MemTotal: 2097152 kB > MemFree: 1197244 kB > Buffers: 7408 kB > Cached: 335876 kB > SwapCached: 0 kB > Active: 127860 kB > Inactive: 308336 kB > HighTotal: 0 kB > HighFree: 0 kB > LowTotal: 2097152 kB > LowFree: 1197244 kB > SwapTotal: 0 kB > SwapFree: 0 kB > Dirty: 152 kB > Writeback: 0 kB > AnonPages: 92852 kB > Mapped: 22608 kB > Slab: 93820 kB > PageTables: 5256 kB > NFS_Unstable: 0 kB > Bounce: 0 kB > CommitLimit: 1048576 kB > Committed_AS: 1747212 kB > VmallocTotal: 34359738367 kB > VmallocUsed: 4828 kB > VmallocChunk: 34359733487 kB > [root@aule ~]# > > > What would be the best way to add SWAP to a node? > Or any other idea? Or hust extende the dom0 Memory? I've seen this behaviour very rarely (only on clusters using xen and redhat cluster) but there seems to be some memory leak somewhere. Last time I've seen it (and I've only seen it about two times) I saw loads of cached objects in /proc/slabinfo. Currently I don't think it's related to using osr but to using the cluster or xen itself. Next time when you see something like this it would be great if you could provide a sysrq-m/t and the output of /proc/slabinfo. Just log in to the node having no memory via telnet <nodename> 12242 then type shell (you're now in a shell in the rescue chroot/fenceacksv). Next make type a cat /proc/slabinfo. Then exit from the shell and back in the fenceacksv type: memory (for sysrq-m) tasks (for sysrq-t) As prerequestit you should redirect the syslog to a central syslog server to get the memory dump and tasks. Also be aware that - if it takes very long for sysrq-t/m to continue - the other nodes might start fencing the one. But as you might nevertheless have to "restart" this node it doesn't hurt too much. Sorry but currently I don't see any other option as this happens very rarely and I could only once trace it down a little more. > > Sincerly, > Klaus Steinberger BTW: as you are using gfs/rgmanager and xen you should be aware of those redhat bugzillas: 485026, 490449, 487214, 468691 -- Gruss / Regards, Marc Grimme http://www.atix.de/ http://www.open-sharedroot.org/ |
From: Klaus S. <kla...@Ph...> - 2009-04-23 08:32:31
|
Hello, > Last time I've seen it (and I've only seen it about two times) I saw loads of > cached objects in /proc/slabinfo. Currently I don't think it's related to > using osr but to using the cluster or xen itself. Now I've got a node with memory trouble. > Next time when you see something like this it would be great if you could > provide a sysrq-m/t and the output of /proc/slabinfo. The sysrq didn't work as none of the nodes responds to Port 12242, probably fenceackserver is not running (but is configured in cluster.conf). At least the node is halfways responsive, so I could do "cat /proc/slabinfo". Sincerly, Klaus |