From: 刘亚磊 <liu...@eb...> - 2015-07-22 01:52:52
|
mfs版本:2.0.72社区版 master、chunkserver、dataserver、client操作系统:centso 5.9 x64 问题描述: 文件数量千万级,发现mfs集群master每到正点会失去响应1-2分钟。master内存、cpu、硬盘、网络监控正常。最开始使用的是1.6.25版本,怀疑软件自身存在bug,后来升级到2.0.72社区版,问题依然存在。以下是正点的错误日志: Jul 22 07:00:00 mfsmaster1 mfsmaster[22443]: fork error (store data in foreground - it will block master for a while): ENOMEM (Cannot allocate memory) Jul 22 07:01:47 mfsmaster1 mfsmaster[22443]: csdb: found cs using ip:port and csid (192.168.1.82:9422,5), but server is still connected Jul 22 07:01:47 mfsmaster1 mfsmaster[22443]: can't accept chunkserver (ip: 192.168.1.82 / port: 9422) 刘亚磊 | 买卖宝信息技术有限公司 北京市朝阳区红军营南路傲城融富中心C座三层(100012) 直线: (86) 10 56716100-8995 电子邮件: liu...@eb... | 移动电话: (86) 18801039545 |
From: Davies L. <dav...@gm...> - 2015-07-23 20:30:27
|
每到整点的时候,master 会fork一个子进程把内存中的数据快照到磁盘,如果数据量小或者磁盘很快,是不会影响master的响应的。 一旦数据比较大或者磁盘很忙时(并且master还有很多访问),写快照的进程会让磁盘变得繁忙,导致另一个master进程在写changelog 时被阻塞了。 改进办法是使用更好的磁盘(SSD)或者更多内存(使得新写的快照不必立即刷新到磁盘)。 2015-07-21 18:27 GMT-07:00 刘亚磊 <liu...@eb...>: > mfs版本:2.0.72社区版 > master、chunkserver、dataserver、client操作系统:centso 5.9 x64 > > 问题描述: > 文件数量千万级,发现mfs集群master每到正点会失去响应1-2分钟。master内存、cpu、硬盘、网络监控正常。最开始使用的是1.6.25版本,怀疑软件自身存在bug,后来升级到2.0.72社区版,问题依然存在。以下是正点的错误日志: > > > Jul 22 07:00:00 mfsmaster1 mfsmaster[22443]: fork error (store data in > foreground - it will block master for a while): ENOMEM (Cannot allocate > memory) > Jul 22 07:01:47 mfsmaster1 mfsmaster[22443]: csdb: found cs using ip:port > and csid (192.168.1.82:9422,5), but server is still connected > Jul 22 07:01:47 mfsmaster1 mfsmaster[22443]: can't accept chunkserver (ip: > 192.168.1.82 / port: 9422) > > ________________________________ > 刘亚磊 | 买卖宝信息技术有限公司 > 北京市朝阳区红军营南路傲城融富中心C座三层(100012) > 直线: (86) 10 56716100-8995 > 电子邮件: liu...@eb... | 移动电话: (86) 18801039545 > > ------------------------------------------------------------------------------ > > _________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users > -- - Davies |
From: Jakub Kruszona-Z. <jak...@ge...> - 2015-07-24 06:09:15
|
This is caused by check for available memory in Linux. Linux before "fork" checks if it's enough memory for "two" copies of forking process (which is rather stupid because memory is duplicated in COW mode, so usually both processes shares most of their memory). To "fix" this you can change this behaviour to "classic" using this command (as root): echo "1" > /proc/sys/vm/overcommit_memory On 22 Jul, 2015, at 3:27, 刘亚磊 <liu...@eb...> wrote: > mfs版本:2.0.72社区版 > master、chunkserver、dataserver、client操作系统:centso 5.9 x64 > > 问题描述: > 文件数量千万级,发现mfs集群master每到正点会失去响应1-2分钟。master内存、cpu、硬盘、网络监控正常。最开始使用的是1.6.25版本,怀疑软件自身存在bug,后来升级到2.0.72社区版,问题依然存在。以下是正点的错误日志: > > > Jul 22 07:00:00 mfsmaster1 mfsmaster[22443]: fork error (store data in foreground - it will block master for a while): ENOMEM (Cannot allocate memory) > Jul 22 07:01:47 mfsmaster1 mfsmaster[22443]: csdb: found cs using ip:port and csid (192.168.1.82:9422,5), but server is still connected > Jul 22 07:01:47 mfsmaster1 mfsmaster[22443]: can't accept chunkserver (ip: 192.168.1.82 / port: 9422) > > 刘亚磊 | 买卖宝信息技术有限公司 > 北京市朝阳区红军营南路傲城融富中心C座三层(100012) > 直线: (86) 10 56716100-8995 > 电子邮件: liu...@eb... | 移动电话: (86) 18801039545 > ------------------------------------------------------------------------------ > _________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users -- Regards, Jakub Kruszona-Zawadzki - - - - - - - - - - - - - - - - Segmentation fault (core dumped) Phone: +48 602 212 039 |
From: Davies L. <dav...@gm...> - 2015-07-24 06:24:51
|
This makes more sense, fork() failed, so snapshot is done by the only one master process. On Thu, Jul 23, 2015 at 10:29 PM, Jakub Kruszona-Zawadzki <jak...@ge...> wrote: > This is caused by check for available memory in Linux. Linux before "fork" > checks if it's enough memory for "two" copies of forking process (which is > rather stupid because memory is duplicated in COW mode, so usually both > processes shares most of their memory). To "fix" this you can change this > behaviour to "classic" using this command (as root): > > echo "1" > /proc/sys/vm/overcommit_memory > > On 22 Jul, 2015, at 3:27, 刘亚磊 <liu...@eb...> wrote: > > mfs版本:2.0.72社区版 > master、chunkserver、dataserver、client操作系统:centso 5.9 x64 > > 问题描述: > 文件数量千万级,发现mfs集群master每到正点会失去响应1-2分钟。master内存、cpu、硬盘、网络监控正常。最开始使用的是1.6.25版本,怀疑软件自身存在bug,后来升级到2.0.72社区版,问题依然存在。以下是正点的错误日志: > > > Jul 22 07:00:00 mfsmaster1 mfsmaster[22443]: fork error (store data in > foreground - it will block master for a while): ENOMEM (Cannot allocate > memory) > Jul 22 07:01:47 mfsmaster1 mfsmaster[22443]: csdb: found cs using ip:port > and csid (192.168.1.82:9422,5), but server is still connected > Jul 22 07:01:47 mfsmaster1 mfsmaster[22443]: can't accept chunkserver (ip: > 192.168.1.82 / port: 9422) > > ________________________________ > 刘亚磊 | 买卖宝信息技术有限公司 > 北京市朝阳区红军营南路傲城融富中心C座三层(100012) > 直线: (86) 10 56716100-8995 > 电子邮件: liu...@eb... | 移动电话: (86) 18801039545 > ------------------------------------------------------------------------------ > _________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users > > > -- > Regards, > Jakub Kruszona-Zawadzki > - - - - - - - - - - - - - - - - > Segmentation fault (core dumped) > Phone: +48 602 212 039 > > > ------------------------------------------------------------------------------ > > _________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users > -- - Davies |
From: 刘亚磊 <liu...@eb...> - 2015-07-24 06:20:38
|
你好: 根据提示,修改系统内核后,这个问题解决了。但是现在有个新问题,正点的时候,master会报错 Jul 23 20:01:16 mfsmaster1 mfsmaster[22443]: main master server module: (ip:192.168.1.46) write error: EPIPE (Broken pipe) 刘亚磊 | 买卖宝信息技术有限公司 北京市朝阳区红军营南路傲城融富中心C座三层(100012) 直线: (86) 10 56716100-8995 电子邮件: liu...@eb... | 移动电话: (86) 18801039545 发件人: Jakub Kruszona-Zawadzki 发送时间: 2015-07-24 13:29 收件人: 刘亚磊 抄送: moosefs-users 主题: Re: [MooseFS-Users] mfs_master正点失去响应 This is caused by check for available memory in Linux. Linux before "fork" checks if it's enough memory for "two" copies of forking process (which is rather stupid because memory is duplicated in COW mode, so usually both processes shares most of their memory). To "fix" this you can change this behaviour to "classic" using this command (as root): echo "1" > /proc/sys/vm/overcommit_memory On 22 Jul, 2015, at 3:27, 刘亚磊 <liu...@eb...> wrote: mfs版本:2.0.72社区版 master、chunkserver、dataserver、client操作系统:centso 5.9 x64 问题描述: 文件数量千万级,发现mfs集群master每到正点会失去响应1-2分钟。master内存、cpu、硬盘、网络监控正常。最开始使用的是1.6.25版本,怀疑软件自身存在bug,后来升级到2.0.72社区版,问题依然存在。以下是正点的错误日志: Jul 22 07:00:00 mfsmaster1 mfsmaster[22443]: fork error (store data in foreground - it will block master for a while): ENOMEM (Cannot allocate memory) Jul 22 07:01:47 mfsmaster1 mfsmaster[22443]: csdb: found cs using ip:port and csid (192.168.1.82:9422,5), but server is still connected Jul 22 07:01:47 mfsmaster1 mfsmaster[22443]: can't accept chunkserver (ip: 192.168.1.82 / port: 9422) 刘亚磊 | 买卖宝信息技术有限公司 北京市朝阳区红军营南路傲城融富中心C座三层(100012) 直线: (86) 10 56716100-8995 电子邮件: liu...@eb... | 移动电话: (86) 18801039545 ------------------------------------------------------------------------------ _________________________________________ moosefs-users mailing list moo...@li... https://lists.sourceforge.net/lists/listinfo/moosefs-users -- Regards, Jakub Kruszona-Zawadzki - - - - - - - - - - - - - - - - Segmentation fault (core dumped) Phone: +48 602 212 039 |
From: Aleksander W. <ale...@mo...> - 2015-07-24 07:57:58
|
Hi. This is not a problem. It's rather info than an error. This message means, that during data sending through the socket the connection has been closed by other side. It usually means that there were connection timeout. This kind of messages can appear in highly loaded network, but this will not cause any data missing. When system reconnect then such packet will be send again. Best regards Aleksander Wieliczko Technical Support Engineer MooseFS.com <moosefs.com> On 24.07.2015 08:20, 刘亚磊 wrote: > 你好: > 根据提示,修改系统内核后,这个问题解决了。但是现在有个新问题, > 正点的时候,master会报错 > Jul 23 20:01:16 mfsmaster1 mfsmaster[22443]: main master server > module: (ip:192.168.1.46) write error: EPIPE (Broken pipe) > > ------------------------------------------------------------------------ > 刘亚磊 | 买卖宝信息技术有限公司 > 北京市朝阳区红军营南 路傲城融富中心C座三层(100012) > 直线: (86) 10 56716100-8995 > 电子邮件: liu...@eb... | 移动电话: (86) 18801039545 > > > *发件人:* Jakub Kruszona-Zawadzki <mailto:jak...@ge...> > *发送时间:* 2015-07-24 13:29 > *收件人:* 刘亚磊 <mailto:liu...@eb...> > *抄送:* moosefs-users <mailto:moo...@li...> > *主题:* Re: [MooseFS-Users] mfs_master正点失去响应 > This is caused by check for available memory in Linux. Linux > before "fork" checks if it's enough memory for "two" copies of > forking process (which is rather stupid because memory is > duplicated in COW mode, so usually both processes shares most of > their memory). To "fix" this you can change this behaviour to > "classic" using this command (as root): > > echo "1" > /proc/sys/vm/overcommit_memory > > On 22 Jul, 2015, at 3:27, 刘亚磊 <liu...@eb... > <mailto:liu...@eb...>> wrote: > >> mfs版本:2.0.72社区版 >> master、chunkserver、dataserver、client操作系统:centso 5.9 x64 >> >> 问题描述: >> 文件数量千万级,发现mfs集群master每到正点会失去响应1-2分钟。 >> master内存、cpu、硬盘、网络监控正常。最开始使用的是1.6.25版本,怀 >> 疑软件自身存 在bug,后来升级到2.0.72 社区版,问题依然存在。以下是 >> 正点的错误日志: >> >> >> Jul 22 07:00:00 mfsmaster1 mfsmaster[22443]: fork error (store >> data in foreground - it will block master for a while): ENOMEM >> (Cannot allocate memory) >> Jul 22 07:01:47 mfsmaster1 mfsmaster[22443]: csdb: found cs using >> ip:port and csid (192.168.1.82:9422,5), but server is still >> connected >> Jul 22 07:01:47 mfsmaster1 mfsmaster[22443]: can't accept >> chunkserver (ip: 192.168.1.82 / port: 9422) >> >> ------------------------------------------------------------------------ >> 刘亚磊 | 买卖宝信息技术有限公司 >> 北京市朝阳区红军营南路傲城融富中心C座三层(100012) >> 直线: (86) 10 56716100-8995 >> 电子邮件: liu...@eb... <mailto:liu...@eb...> | 移动电 >> 话: (86) 18801039545 >> ------------------------------------------------------------------------------ >> _________________________________________ >> moosefs-users mailing list >> moo...@li... >> <mailto:moo...@li...> >> https://lists.sourceforge.net/lists/listinfo/moosefs-users > > -- > Regards, > Jakub Kruszona-Zawadzki > - - - - - - - - - - - - - - - - > Segmentation fault (core dumped) > Phone: +48 602 212 039 > > > > ------------------------------------------------------------------------------ > > > _________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users |