From: Robert S. <rsa...@ne...> - 2011-08-04 00:42:26
|
We have been spending a lot of time trying to get MooseFS stable and optimized. Something I have noticed is that mfsmaster seems to be a bottleneck in our setup. What I also noticed is that mfsmaster is single threaded. From reading the source code it seems to use a very interesting polling loop to handle all communications and actions. So a question: Is there anything on the roadmap to make mfsmaster multithreaded? It also seems that the performance of MooseFS is very dependent on the performance of mfsmaster. If the machine running mfsmaster is slow or is busy then it can slow everything down significantly or even cause instability in the file system. This also implies that if you want to buy a dedicated machine for mfsmaster that you have to buy the fastest possible CPU and as much RAM as you need. Local disk space and multiple CPUs and cores are not important. Is this correct? What would the recommendation be for an optimal machine to run mfsmaster? Robert |
From: Michal B. <mic...@ge...> - 2011-08-08 06:53:20
|
Hi Robert I wrote shortly about multithreading in mfsmaster here: http://sourceforge.net/mailarchive/message.php?msg_id=26680860 So no, it is not on our roadmap. And yes, performance of MooseFS is dependent on the performance of mfsmaster. CPU load depends on amount of operations in the filesystem. In our environment the master server consumes about 30% of CPU (ca. 1500 operations per second). HDD doesn't have to be huge, but still it should be quick for dumps of metadata and continuous saving of changelogs. Rough estimate how much RAM you need is here: http://www.moosefs.org/moosefs-faq.html#sort And to be honest metalogger machines should be as good as the master itself because in case of emergency metalogger should be switched to the role of the master. Kind regards -Michal -----Original Message----- From: Robert Sandilands [mailto:rsa...@ne...] Sent: Thursday, August 04, 2011 2:42 AM To: moo...@li... Subject: [Moosefs-users] mfsmaster performance and hardware We have been spending a lot of time trying to get MooseFS stable and optimized. Something I have noticed is that mfsmaster seems to be a bottleneck in our setup. What I also noticed is that mfsmaster is single threaded. From reading the source code it seems to use a very interesting polling loop to handle all communications and actions. So a question: Is there anything on the roadmap to make mfsmaster multithreaded? It also seems that the performance of MooseFS is very dependent on the performance of mfsmaster. If the machine running mfsmaster is slow or is busy then it can slow everything down significantly or even cause instability in the file system. This also implies that if you want to buy a dedicated machine for mfsmaster that you have to buy the fastest possible CPU and as much RAM as you need. Local disk space and multiple CPUs and cores are not important. Is this correct? What would the recommendation be for an optimal machine to run mfsmaster? Robert ---------------------------------------------------------------------------- -- BlackBerry® DevCon Americas, Oct. 18-20, San Francisco, CA The must-attend event for mobile developers. Connect with experts. Get tools for creating Super Apps. See the latest technologies. Sessions, hands-on labs, demos & much more. Register early & save! http://p.sf.net/sfu/rim-blackberry-1 _______________________________________________ moosefs-users mailing list moo...@li... https://lists.sourceforge.net/lists/listinfo/moosefs-users |
From: Robert S. <rsa...@ne...> - 2011-08-08 12:53:15
|
Hi Michal, The paper is an interesting read. I think that the growth of technology is however making it impractical. Both Intel and AMD are working on 20+ core CPU's and there are some non-x86 64-core systems available today. These systems focus on slower cores but many of them. For systems to be able to scale in the future they need to be able to effectively use the hardware that will be available then. Unfortunately (according to the paper) the paradigm that is winning on both the hardware and software fronts are threads. Does threads have problems? Yes, but it may be slightly less problematic than pointers ;-) With a 2 GHz Xeon I am seeing scaling problems when you approach 94 million files. I had another crash this weekend and had to increase timeouts yet again. At this stage the master is unresponsive for at least 5 minutes every hour. The graphs in the CGI look like a comb with 0 activity on the hour every hour for about 5 minutes. That is except for CPU usage on the master which spikes to 100% for the same period. We did see an increase in performance and stability when we moved some tasks from the master server to other machines but at this stage we can't move more tasks off the master without buying more hardware. During the time of 0 activity we see read and write timeouts and the filesystem is completely unresponsive to users. I am convinced that part of the scalability issue is related to the fact that everything is single threaded and that any single task that can take a long time has the potential to cause problems affecting scalability and stability. We still have another approximately 16 TB to move to MooseFS so I do expect us to easily pass the 100 million file mark. As we are deduplicating the files as we move them it is hard to predict how much space/files it will be when we are done. We are also adding more than 4 million files and 2 TB per month (before deduplication). Robert On 8/8/11 2:52 AM, Michal Borychowski wrote: > Hi Robert > > I wrote shortly about multithreading in mfsmaster here: > http://sourceforge.net/mailarchive/message.php?msg_id=26680860 > > So no, it is not on our roadmap. > > And yes, performance of MooseFS is dependent on the performance of > mfsmaster. CPU load depends on amount of operations in the filesystem. In > our environment the master server consumes about 30% of CPU (ca. 1500 > operations per second). HDD doesn't have to be huge, but still it should be > quick for dumps of metadata and continuous saving of changelogs. > > Rough estimate how much RAM you need is here: > http://www.moosefs.org/moosefs-faq.html#sort > > And to be honest metalogger machines should be as good as the master itself > because in case of emergency metalogger should be switched to the role of > the master. > > > Kind regards > -Michal > > > -----Original Message----- > From: Robert Sandilands [mailto:rsa...@ne...] > Sent: Thursday, August 04, 2011 2:42 AM > To: moo...@li... > Subject: [Moosefs-users] mfsmaster performance and hardware > > We have been spending a lot of time trying to get MooseFS stable and > optimized. > > Something I have noticed is that mfsmaster seems to be a bottleneck in > our setup. What I also noticed is that mfsmaster is single threaded. > From reading the source code it seems to use a very interesting polling > loop to handle all communications and actions. > > So a question: Is there anything on the roadmap to make mfsmaster > multithreaded? > > It also seems that the performance of MooseFS is very dependent on the > performance of mfsmaster. If the machine running mfsmaster is slow or is > busy then it can slow everything down significantly or even cause > instability in the file system. > > This also implies that if you want to buy a dedicated machine for > mfsmaster that you have to buy the fastest possible CPU and as much RAM > as you need. Local disk space and multiple CPUs and cores are not > important. Is this correct? What would the recommendation be for an > optimal machine to run mfsmaster? > > Robert > > > ---------------------------------------------------------------------------- > -- > BlackBerry® DevCon Americas, Oct. 18-20, San Francisco, CA > The must-attend event for mobile developers. Connect with experts. > Get tools for creating Super Apps. See the latest technologies. > Sessions, hands-on labs, demos& much more. Register early& save! > http://p.sf.net/sfu/rim-blackberry-1 > _______________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users > |
From: Elliot F. <efi...@gm...> - 2011-08-08 19:33:43
Attachments:
filesystem.c.patch
|
On Mon, Aug 8, 2011 at 6:52 AM, Robert Sandilands <rsa...@ne...> wrote: > Hi Michal, > > With a 2 GHz Xeon I am seeing scaling problems when you approach 94 > million files. I had another crash this weekend and had to increase > timeouts yet again. At this stage the master is unresponsive for at > least 5 minutes every hour. The graphs in the CGI look like a comb with > 0 activity on the hour every hour for about 5 minutes. That is except > for CPU usage on the master which spikes to 100% for the same period. We > did see an increase in performance and stability when we moved some > tasks from the master server to other machines but at this stage we > can't move more tasks off the master without buying more hardware. > During the time of 0 activity we see read and write timeouts and the > filesystem is completely unresponsive to users. > > I am convinced that part of the scalability issue is related to the fact > that everything is single threaded and that any single task that can > take a long time has the potential to cause problems affecting > scalability and stability. Robert, Metadata access is single threaded, but at the top of every hour when the metadata is stored, the mfsmaster process is essentially dual-threaded (or more accurately dual-processed). The process forks (or at least tries to) and the metadata is stored in a background process allowing the main process to continue to serve requests. If you only have a single core on your master, then obviously both processes will have to use it and thus it will spike every hour when the metadata is stored, but it should still continue to serve requests. If the 'fork' doesn't happen for any reason then the mfsmaster will stop serving requests and store the metadata, thus pausing all clients regardless of how many cores you have. And finally, if you have multiple cores and the fork works, you *should* be able to store the metadata and continue to serve client requests without a noticeable delay. Attached is a patch for filesystem.c that will indicate in your log file whether or not the fork was successful. I'd be curious to see the results. Elliot |
From: Elliot F. <efi...@gm...> - 2011-08-08 19:46:11
Attachments:
filesystem.c.patch
|
On Mon, Aug 8, 2011 at 1:33 PM, Elliot Finley <efi...@gm...> wrote: > Attached is a patch for filesystem.c that will indicate in your log > file whether or not the fork was successful. I'd be curious to see > the results. Sorry, that last patch has a small problem, attached is the correct one. Elliot |
From: Robert S. <rsa...@ne...> - 2011-08-08 22:37:24
|
Or I can log into the system on the hour and see if two processes named mfsmaster exists. In my case it does not which may indicate that fork() is failing. Running strace on the single instance of mfsmaster also indicates it is busy writing to a file and I can see the the following files: -rw-r----- 1 daemon daemon 11G Aug 8 18:02 metadata.mfs.back -rw-r----- 1 daemon daemon 11G Aug 8 17:02 metadata.mfs.back.tmp metadata.mfs.back.tmp was deleted several seconds later. iostat -x also indicates 100% utilization on the volume where the meta-data is stored with a very high number of writes. This leaves me with: 1. Get a faster disk for doing the metadata backups on (SSD?) 2. Figure out why fork() is failing mfsmaster is the only process using more than 5 GB of RAM on the machine (32.6 GB). mfschunkserver uses 4.8 GB. No processes seems to be locking any significant amount of memory. The number of processes created per second < 1. The machine has 64 GB of RAM. Robert On 8/8/11 3:46 PM, Elliot Finley wrote: > On Mon, Aug 8, 2011 at 1:33 PM, Elliot Finley<efi...@gm...> wrote: >> Attached is a patch for filesystem.c that will indicate in your log >> file whether or not the fork was successful. I'd be curious to see >> the results. > Sorry, that last patch has a small problem, attached is the correct one. > > Elliot |
From: Robert S. <rsa...@ne...> - 2011-08-08 23:25:10
|
When I run a strace() on mfsmaster on the hour I get the following: rename("changelog.1.mfs", "changelog.2.mfs") = 0 rename("changelog.0.mfs", "changelog.1.mfs") = 0 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b571b910b80) = -1 ENOMEM (Cannot allocate memory) rename("metadata.mfs.back", "metadata.mfs.back.tmp") = 0 open("metadata.mfs.back", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 11 This indicates fork() is failing with an out of memory error. The system has 11 GB cached and 300 MB free. It only has 6 GB of swap. This indicates that clone() ( also known as fork() ) may be trying to test whether the whole process will fit into memory when cloned. Which implies that the memory requirement is actually double than what is commonly believed. I can probably increase swap to make it happy, but that has its own set of issues and is unlikely to solve much as it will be a similar situation if mfsmaster starts swapping. Although in theory mfsmaster should not start swapping as a very low percentage of the forked process will actually be different than the original one. I am testing my theory ;-) Robert On 8/8/11 6:36 PM, Robert Sandilands wrote: > Or I can log into the system on the hour and see if two processes > named mfsmaster exists. In my case it does not which may indicate that > fork() is failing. > > Running strace on the single instance of mfsmaster also indicates it > is busy writing to a file and I can see the the following files: > > -rw-r----- 1 daemon daemon 11G Aug 8 18:02 metadata.mfs.back > -rw-r----- 1 daemon daemon 11G Aug 8 17:02 metadata.mfs.back.tmp > > metadata.mfs.back.tmp was deleted several seconds later. > > iostat -x also indicates 100% utilization on the volume where the > meta-data is stored with a very high number of writes. > > This leaves me with: > > 1. Get a faster disk for doing the metadata backups on (SSD?) > 2. Figure out why fork() is failing > > mfsmaster is the only process using more than 5 GB of RAM on the > machine (32.6 GB). mfschunkserver uses 4.8 GB. No processes seems to > be locking any significant amount of memory. The number of processes > created per second < 1. The machine has 64 GB of RAM. > > Robert > > On 8/8/11 3:46 PM, Elliot Finley wrote: >> On Mon, Aug 8, 2011 at 1:33 PM, Elliot >> Finley<efi...@gm...> wrote: >>> Attached is a patch for filesystem.c that will indicate in your log >>> file whether or not the fork was successful. I'd be curious to see >>> the results. >> Sorry, that last patch has a small problem, attached is the correct one. >> >> Elliot > > |
From: Elliot F. <efi...@gm...> - 2011-08-09 04:56:20
|
On Mon, Aug 8, 2011 at 5:24 PM, Robert Sandilands <rsa...@ne...> wrote: > When I run a strace() on mfsmaster on the hour I get the following: > > rename("changelog.1.mfs", "changelog.2.mfs") = 0 > rename("changelog.0.mfs", "changelog.1.mfs") = 0 > clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, > child_tidptr=0x2b571b910b80) = -1 ENOMEM (Cannot allocate memory) > rename("metadata.mfs.back", "metadata.mfs.back.tmp") = 0 > open("metadata.mfs.back", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 11 > > This indicates fork() is failing with an out of memory error. The system has > 11 GB cached and 300 MB free. It only has 6 GB of swap. This indicates that > clone() ( also known as fork() ) may be trying to test whether the whole > process will fit into memory when cloned. Which implies that the memory > requirement is actually double than what is commonly believed. Just out of curiosity, what OS are you using? Elliot |
From: Robert S. <rsa...@ne...> - 2011-08-09 12:11:51
|
On 8/9/11 12:56 AM, Elliot Finley wrote: > On Mon, Aug 8, 2011 at 5:24 PM, Robert Sandilands<rsa...@ne...> wrote: >> When I run a strace() on mfsmaster on the hour I get the following: >> >> rename("changelog.1.mfs", "changelog.2.mfs") = 0 >> rename("changelog.0.mfs", "changelog.1.mfs") = 0 >> clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, >> child_tidptr=0x2b571b910b80) = -1 ENOMEM (Cannot allocate memory) >> rename("metadata.mfs.back", "metadata.mfs.back.tmp") = 0 >> open("metadata.mfs.back", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 11 >> >> This indicates fork() is failing with an out of memory error. The system has >> 11 GB cached and 300 MB free. It only has 6 GB of swap. This indicates that >> clone() ( also known as fork() ) may be trying to test whether the whole >> process will fit into memory when cloned. Which implies that the memory >> requirement is actually double than what is commonly believed. > Just out of curiosity, what OS are you using? > > Elliot Linux (Centos 5.6 64-bit). Robert |
From: Elliot F. <efi...@gm...> - 2011-08-09 16:39:58
|
On Tue, Aug 9, 2011 at 5:50 AM, Robert Sandilands <rsa...@ne...> wrote: > On 8/9/11 12:56 AM, Elliot Finley wrote: >> >> Just out of curiosity, what OS are you using? >> >> Elliot > > Linux (Centos 5.6 64-bit). > > Robert > If/when you get the fork working, please let us (the list) know what it took. Elliot |
From: Robert S. <rsa...@ne...> - 2011-08-10 00:46:59
|
Increasing the swap space fixed the fork() issue. It seems that you have to ensure that memory available is always double the memory needed by mfsmaster. None of the swap space was used over the last 24 hours. This did solve the extreme comb-like behavior of mfsmaster. It still does not resolve its sensitivity to load on the server. I am still seeing timeouts on the chunkservers and mounts on the hour due to the high CPU and I/O load when the meta data is dumped to disk. It did however decrease significantly. An example from the logs: Aug 9 04:03:38 http-lb-1 mfsmount[13288]: master: tcp recv error: ETIMEDOUT (Operation timed out) (1) Aug 9 04:03:39 http-lb-1 mfsmount[13288]: master: register error (read header: ETIMEDOUT (Operation timed out)) Aug 9 04:03:41 http-lb-1 mfsmount[13288]: registered to master Robert On 8/9/11 12:39 PM, Elliot Finley wrote: > On Tue, Aug 9, 2011 at 5:50 AM, Robert Sandilands<rsa...@ne...> wrote: >> On 8/9/11 12:56 AM, Elliot Finley wrote: >>> Just out of curiosity, what OS are you using? >>> >>> Elliot >> Linux (Centos 5.6 64-bit). >> >> Robert >> > If/when you get the fork working, please let us (the list) know what it took. > > Elliot |
From: Laurent W. <lw...@hy...> - 2011-08-10 11:43:40
|
On Tue, 09 Aug 2011 20:46:45 -0400 Robert Sandilands <rsa...@ne...> wrote: > Increasing the swap space fixed the fork() issue. It seems that you have > to ensure that memory available is always double the memory needed by > mfsmaster. None of the swap space was used over the last 24 hours. > > This did solve the extreme comb-like behavior of mfsmaster. It still > does not resolve its sensitivity to load on the server. I am still > seeing timeouts on the chunkservers and mounts on the hour due to the > high CPU and I/O load when the meta data is dumped to disk. It did > however decrease significantly. > > An example from the logs: > > Aug 9 04:03:38 http-lb-1 mfsmount[13288]: master: tcp recv error: > ETIMEDOUT (Operation timed out) (1) > Aug 9 04:03:39 http-lb-1 mfsmount[13288]: master: register error (read > header: ETIMEDOUT (Operation timed out)) > Aug 9 04:03:41 http-lb-1 mfsmount[13288]: registered to master Hi, what if you apply these tweaks to ip stack on master/CS/metaloggers ? # to avoid problems with heavily loaded servers echo 16000 > /proc/sys/fs/file-max echo 100000 > /proc/sys/net/ipv4/ip_conntrack_max # to avoid Neighbour table overflow echo "512" > /proc/sys/net/ipv4/neigh/default/gc_thresh1 echo "2048" > /proc/sys/net/ipv4/neigh/default/gc_thresh2 echo "4048" > /proc/sys/net/ipv4/neigh/default/gc_thresh3 No need to restart anything, these can be applied on the fly without disturbing services. HTH, -- Laurent Wandrebeck HYGEOS, Earth Observation Department / Observation de la Terre Euratechnologies 165 Avenue de Bretagne 59000 Lille, France tel: +33 3 20 08 24 98 http://www.hygeos.com GPG fingerprint/Empreinte GPG: F5CA 37A4 6D03 A90C 7A1D 2A62 54E6 EF2C D17C F64C |
From: Elliot F. <efi...@gm...> - 2011-08-10 15:56:44
|
On Tue, Aug 9, 2011 at 6:46 PM, Robert Sandilands <rsa...@ne...> wrote: > Increasing the swap space fixed the fork() issue. It seems that you have to > ensure that memory available is always double the memory needed by > mfsmaster. None of the swap space was used over the last 24 hours. > > This did solve the extreme comb-like behavior of mfsmaster. It still does > not resolve its sensitivity to load on the server. I am still seeing > timeouts on the chunkservers and mounts on the hour due to the high CPU and > I/O load when the meta data is dumped to disk. It did however decrease > significantly. > > An example from the logs: > > Aug 9 04:03:38 http-lb-1 mfsmount[13288]: master: tcp recv error: ETIMEDOUT > (Operation timed out) (1) > Aug 9 04:03:39 http-lb-1 mfsmount[13288]: master: register error (read > header: ETIMEDOUT (Operation timed out)) > Aug 9 04:03:41 http-lb-1 mfsmount[13288]: registered to master Are you using this server as a combination mfsmaster/chunkserver/mfsclient? If so, is the metadata being written to a spindle(s) that are separate from what the chunkserver is using? How is this box laid out? Elliot |
From: Robert S. <rsa...@ne...> - 2011-08-11 03:11:57
|
These logs were from a machine that is only running mfsmount and Apache. Load is generally 10+ with I/O wait in the 40-90% range. It has 4 cores and 8 GB of RAM. It is in a DNS round-robin pool with 4 other similar machines. MooseFS is mounted in fstab using the following command: mfsmount /srv/mfs fuse mfsmaster=mfsmaster,mfsioretries=300,mfsattrcacheto=60,mfsdirentrycacheto=60,mfsentrycacheto=30,_netdev 0 0 Apache has sendfile disabled. The total amount of data transferred through the 5 mfsmounts is slightly more than 1 TB per day. It sounds impressive but it really is only around 13 MB/s. It is extremely rare for the same file to be downloaded twice in a day. Caching folders and their attributes is potentially useful. Caching files is not. mfsmaster runs on the one chunkserver. The second chunkserver is a dedicated chunkserver. The third chunkserver also runs mfsmetalogger. The second chunkserver only has 2.5 million of the 96 million chunks so it is not contributing much yet. On the master: The metadata is written on a SATA RAID1 volume. The chunks are stored on a storage array that is connected via SAS. The only activity on the SATA volume is the OS, metadata and local syslog logging. There is a second SAS array that is used to stage files for deduplication. Part of the deduplication process also moves it to the MooseFS volume. The server is a dual quad-core 2 GHz Xeon and the average load is generally less than 5. The deduplication uses a local mfsmount but is the only user of the mount. Here is the matching logs from the master: Aug 10 22:03:30 mfsmaster mfsmaster[xxxxx]: connection with client(ip:xxx.xxx.xxx.65) has been closed by peer Aug 10 22:03:30 mfsmaster mfsmaster[xxxxx]: connection with client(ip:xxx.xxx.xxx.102) has been closed by peer Aug 10 22:03:30 mfsmaster mfsmaster[xxxxx]: connection with client(ip:xxx.xxx.xxx.14) has been closed by peer Aug 10 22:03:30 mfsmaster mfsmaster[xxxxx]: connection with client(ip:xxx.xxx.xxx.14) has been closed by peer Aug 10 22:03:30 mfsmaster mfsmaster[xxxxx]: connection with client(ip:xxx.xxx.xxx.102) has been closed by peer Aug 10 22:03:30 mfsmaster mfsmaster[xxxxx]: connection with client(ip:xxx.xxx.xxx.65) has been closed by peer Aug 10 22:03:39 mfsmaster mfsmaster[xxxxx]: connection with client(ip:xxx.xxx.xxx.14) has been closed by peer Aug 10 22:03:41 mfsmaster mfsmaster[xxxxx]: connection with client(ip:xxx.xxx.xxx.102) has been closed by peer Aug 10 22:03:41 mfsmaster mfsmaster[xxxxx]: connection with client(ip:xxx.xxx.xxx.65) has been closed by peer Aug 10 22:03:41 mfsmaster mfsmaster[xxxxx]: connection with client(ip:xxx.xxx.xxx.14) has been closed by peer Aug 10 22:03:41 mfsmaster mfsmaster[xxxxx]: connection with client(ip:xxx.xxx.xxx.102) has been closed by peer Aug 10 22:03:41 mfsmaster mfsmaster[xxxxx]: connection with client(ip:xxx.xxx.xxx.65) has been closed by peer Aug 10 22:03:41 mfsmaster mfsmaster[xxxxx]: connection with client(ip:xxx.xxx.xxx.14) has been closed by peer Aug 10 22:03:41 mfsmaster mfsmaster[xxxxx]: connection with client(ip:xxx.xxx.xxx.102) has been closed by peer Aug 10 22:03:41 mfsmaster mfsmaster[xxxxx]: connection with client(ip:xxx.xxx.xxx.65) has been closed by peer Robert On 8/10/11 11:56 AM, Elliot Finley wrote: > On Tue, Aug 9, 2011 at 6:46 PM, Robert Sandilands<rsa...@ne...> wrote: >> Increasing the swap space fixed the fork() issue. It seems that you have to >> ensure that memory available is always double the memory needed by >> mfsmaster. None of the swap space was used over the last 24 hours. >> >> This did solve the extreme comb-like behavior of mfsmaster. It still does >> not resolve its sensitivity to load on the server. I am still seeing >> timeouts on the chunkservers and mounts on the hour due to the high CPU and >> I/O load when the meta data is dumped to disk. It did however decrease >> significantly. >> >> An example from the logs: >> >> Aug 9 04:03:38 http-lb-1 mfsmount[13288]: master: tcp recv error: ETIMEDOUT >> (Operation timed out) (1) >> Aug 9 04:03:39 http-lb-1 mfsmount[13288]: master: register error (read >> header: ETIMEDOUT (Operation timed out)) >> Aug 9 04:03:41 http-lb-1 mfsmount[13288]: registered to master > Are you using this server as a combination mfsmaster/chunkserver/mfsclient? > > If so, is the metadata being written to a spindle(s) that are separate > from what the chunkserver is using? > > How is this box laid out? > > Elliot |
From: Elliot F. <efi...@gm...> - 2011-08-13 03:10:29
|
On Wed, Aug 10, 2011 at 9:11 PM, Robert Sandilands <rsa...@ne...> wrote: > mfsmaster runs on the one chunkserver. The second chunkserver is a dedicated > chunkserver. The third chunkserver also runs mfsmetalogger. The second > chunkserver only has 2.5 million of the 96 million chunks so it is not > contributing much yet. > > On the master: > > The metadata is written on a SATA RAID1 volume. The chunks are stored on a > storage array that is connected via SAS. The only activity on the SATA > volume is the OS, metadata and local syslog logging. There is a second SAS > array that is used to stage files for deduplication. Part of the > deduplication process also moves it to the MooseFS volume. The server is a > dual quad-core 2 GHz Xeon and the average load is generally less than 5. The > deduplication uses a local mfsmount but is the only user of the mount. Although it seems this box should be able to handle the load with no problem, the obvious next step in stabilizing your cluster is to move the mfsmaster onto a box dedicated to the mfsmaster process. It also seems this would be a golden opportunity for the developers to take a look at your box and see why you are getting the client disconnects. If they could figure it out and tweak the code for your box, it would make their own cluster that much more stable. Elliot |
From: Elliot F. <efi...@gm...> - 2011-08-24 15:06:05
|
On Tue, Aug 9, 2011 at 6:46 PM, Robert Sandilands <rsa...@ne...> wrote: > Increasing the swap space fixed the fork() issue. It seems that you have to > ensure that memory available is always double the memory needed by > mfsmaster. None of the swap space was used over the last 24 hours. > > This did solve the extreme comb-like behavior of mfsmaster. It still does > not resolve its sensitivity to load on the server. I am still seeing > timeouts on the chunkservers and mounts on the hour due to the high CPU and > I/O load when the meta data is dumped to disk. It did however decrease > significantly. Here is another thought on this... The process is niced to -19 (very high priority) so that it has good performance. It forks once per hour to write out the metadata. I haven't checked the code for this, but is the forked process lowering it's priority so it doesn't compete with the original process? If it's not, it should be an easy code change to lower the priority in the child process (metadata writer) so that it doesn't compete with the original process at the same priority. If you check into this, I'm sure the list would appreciate an update. :) Elliot |
From: Robert S. <rsa...@ne...> - 2011-08-26 01:08:50
|
Hi Elliot, There is nothing in the code to change the priority. Taking virtually all other load from the chunk and master servers seems to have improved this significantly. I still see timeouts from mfsmount, but not enough to be problematic. To try and optimize the performance I am experimenting with accessing the data using different APIs and block sizes but this has been inconclusive. I have tried the effect of posix_fadvise(), sendfile() and different sized buffers for read(). I still want to try mmap(). Sendfile() did seem to be slightly slower than read(). Robert On 8/24/11 11:05 AM, Elliot Finley wrote: > On Tue, Aug 9, 2011 at 6:46 PM, Robert Sandilands<rsa...@ne...> wrote: >> Increasing the swap space fixed the fork() issue. It seems that you have to >> ensure that memory available is always double the memory needed by >> mfsmaster. None of the swap space was used over the last 24 hours. >> >> This did solve the extreme comb-like behavior of mfsmaster. It still does >> not resolve its sensitivity to load on the server. I am still seeing >> timeouts on the chunkservers and mounts on the hour due to the high CPU and >> I/O load when the meta data is dumped to disk. It did however decrease >> significantly. > Here is another thought on this... > > The process is niced to -19 (very high priority) so that it has good > performance. It forks once per hour to write out the metadata. I > haven't checked the code for this, but is the forked process lowering > it's priority so it doesn't compete with the original process? > > If it's not, it should be an easy code change to lower the priority in > the child process (metadata writer) so that it doesn't compete with > the original process at the same priority. > > If you check into this, I'm sure the list would appreciate an update. :) > > Elliot |
From: Davies L. <dav...@gm...> - 2011-08-26 07:25:36
|
Hi Robert, Another hint to make mfsmaster more responsive is to locate the metadata.mfs on a separated disk with change logs, such as SAS array, then you should modify the source code of mfsmaster to do this. PS: what is the average size of you files? MooseFS (like GFS) is designed for large file (100M+), it can not serve well for amount of small files. Haystack from Facebook may be the better choice. We (douban.com) use MooseFS to serve 200+T(1M files) offline data and beansdb [1] to serve 500 million online small files, it performs very well. [1]: http://code.google.com/p/*beansdb*/ Davies On Fri, Aug 26, 2011 at 9:08 AM, Robert Sandilands <rsa...@ne...>wrote: > Hi Elliot, > > There is nothing in the code to change the priority. > > Taking virtually all other load from the chunk and master servers seems > to have improved this significantly. I still see timeouts from mfsmount, > but not enough to be problematic. > > To try and optimize the performance I am experimenting with accessing > the data using different APIs and block sizes but this has been > inconclusive. I have tried the effect of posix_fadvise(), sendfile() and > different sized buffers for read(). I still want to try mmap(). > Sendfile() did seem to be slightly slower than read(). > > Robert > > On 8/24/11 11:05 AM, Elliot Finley wrote: > > On Tue, Aug 9, 2011 at 6:46 PM, Robert Sandilands<rsa...@ne...> > wrote: > >> Increasing the swap space fixed the fork() issue. It seems that you have > to > >> ensure that memory available is always double the memory needed by > >> mfsmaster. None of the swap space was used over the last 24 hours. > >> > >> This did solve the extreme comb-like behavior of mfsmaster. It still > does > >> not resolve its sensitivity to load on the server. I am still seeing > >> timeouts on the chunkservers and mounts on the hour due to the high CPU > and > >> I/O load when the meta data is dumped to disk. It did however decrease > >> significantly. > > Here is another thought on this... > > > > The process is niced to -19 (very high priority) so that it has good > > performance. It forks once per hour to write out the metadata. I > > haven't checked the code for this, but is the forked process lowering > > it's priority so it doesn't compete with the original process? > > > > If it's not, it should be an easy code change to lower the priority in > > the child process (metadata writer) so that it doesn't compete with > > the original process at the same priority. > > > > If you check into this, I'm sure the list would appreciate an update. :) > > > > Elliot > > > > ------------------------------------------------------------------------------ > EMC VNX: the world's simplest storage, starting under $10K > The only unified storage solution that offers unified management > Up to 160% more powerful than alternatives and 25% more efficient. > Guaranteed. http://p.sf.net/sfu/emc-vnx-dev2dev > _______________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users > -- - Davies |
From: Robert S. <rsa...@ne...> - 2011-08-26 12:46:46
|
Hi Davies, Our average file sizes is around 560 kB and it grows by approximately 100 kB per year. Our hot-set of files are around 14 million files taking slightly less than 8 TB of space. Around 1 million files are added and removed per week. There is also some growth in the number of hot files with it doubling every 2 years. In a realistic world I would have a dual level storage arrangement with faster storage for the hot files, but that is not a choice available to me. I have experimented with storing the files in a database and it has not been a great success. Databases are generally not optimized for storing large blobs and a lot of databases simply won't store blobs bigger than a certain size. Beansdb looks like something I have been looking for but the lack of English documentation is a bit scary. I did look at it through Google translate and even then the documentation is a bit on the scarce side. Robert On 8/26/11 3:25 AM, Davies Liu wrote: > Hi Robert, > > Another hint to make mfsmaster more responsive is to locate the > metadata.mfs > on a separated disk with change logs, such as SAS array, then you > should modify > the source code of mfsmaster to do this. > > PS: what is the average size of you files? MooseFS (like GFS) is > designed for > large file (100M+), it can not serve well for amount of small files. > Haystack from > Facebook may be the better choice. We (douban.com <http://douban.com>) > use MooseFS to serve > 200+T(1M files) offline data and beansdb [1] to serve 500 million > online small > files, it performs very well. > > [1]: http://code.google.com/p/ <http://code.google.com/p/>*beansdb*/ > > Davies > > On Fri, Aug 26, 2011 at 9:08 AM, Robert Sandilands > <rsa...@ne... <mailto:rsa...@ne...>> wrote: > > Hi Elliot, > > There is nothing in the code to change the priority. > > Taking virtually all other load from the chunk and master servers > seems > to have improved this significantly. I still see timeouts from > mfsmount, > but not enough to be problematic. > > To try and optimize the performance I am experimenting with accessing > the data using different APIs and block sizes but this has been > inconclusive. I have tried the effect of posix_fadvise(), > sendfile() and > different sized buffers for read(). I still want to try mmap(). > Sendfile() did seem to be slightly slower than read(). > > Robert > > On 8/24/11 11:05 AM, Elliot Finley wrote: > > On Tue, Aug 9, 2011 at 6:46 PM, Robert > Sandilands<rsa...@ne... <mailto:rsa...@ne...>> > wrote: > >> Increasing the swap space fixed the fork() issue. It seems that > you have to > >> ensure that memory available is always double the memory needed by > >> mfsmaster. None of the swap space was used over the last 24 hours. > >> > >> This did solve the extreme comb-like behavior of mfsmaster. It > still does > >> not resolve its sensitivity to load on the server. I am still > seeing > >> timeouts on the chunkservers and mounts on the hour due to the > high CPU and > >> I/O load when the meta data is dumped to disk. It did however > decrease > >> significantly. > > Here is another thought on this... > > > > The process is niced to -19 (very high priority) so that it has good > > performance. It forks once per hour to write out the metadata. I > > haven't checked the code for this, but is the forked process > lowering > > it's priority so it doesn't compete with the original process? > > > > If it's not, it should be an easy code change to lower the > priority in > > the child process (metadata writer) so that it doesn't compete with > > the original process at the same priority. > > > > If you check into this, I'm sure the list would appreciate an > update. :) > > > > Elliot > > > ------------------------------------------------------------------------------ > EMC VNX: the world's simplest storage, starting under $10K > The only unified storage solution that offers unified management > Up to 160% more powerful than alternatives and 25% more efficient. > Guaranteed. http://p.sf.net/sfu/emc-vnx-dev2dev > _______________________________________________ > moosefs-users mailing list > moo...@li... > <mailto:moo...@li...> > https://lists.sourceforge.net/lists/listinfo/moosefs-users > > > > > -- > - Davies |
From: Davies L. <dav...@gm...> - 2011-08-31 02:42:54
|
The bottle neck is FUSE and mfsmount, you should try to use native API ( borrowed from mfsmount) of MFS to re-implement a HTTP server, one socket per thread , or sockets pool. I just want do it in Go, may by python is easier. Davies On Wed, Aug 31, 2011 at 8:54 AM, Robert Sandilands <rsa...@ne...>wrote: > Further on this subject. > > I wrote a dedicated http server to serve the files instead of using Apache. > It allowed me to gain a few extra percent of performance and decreased the > memory usage of the web servers. The web server also gave me some > interesting timings: > > File open average 405.3732 ms > File read average 238.7784 ms > File close average 286.8376 ms > File size average 0.0026 ms > Net read average 2.536 ms > Net write average 2.2148 ms > Log to access log average 0.2526 ms > Log to error log average 0.2234 ms > > Average time to process a file 936.2186 ms > Total files processed 1,503,610 > > What I really find scary is that to open a file takes nearly half a second. > To close a file a quarter of a second. The time to open() and close() is > nearly 3 times more than the time to read the data. The server always reads > in multiples of 64 kB except if there are less data available. Although it > uses posix_fadvise() to try and do some read-ahead. This is the average over > 5 machines running mfsmount and my custom web server running for about 18 > hours. > > On a machine that only serves a low number of clients the times for open > and close are negligible. open() and close() seems to scale very badly with > an increase in clients using mfsmount. > > From looking at the code for mfsmount it seems like all communication to > the master happens over a single TCP socket with a global handle and mutex > to protect it. This may be the bottle neck? If there are multiple open()'s > at the same time they may end up waiting for the mutex to get an opportunity > to communicate with the master? The same handle and mutex is also used to > read replies and this may also not help the situation? > > What prevents multiple sockets to the master? > > It also seems to indicate that the only way to get the open() average down > is to introduce more web servers and that a single web server can only serve > a very low number of clients. Is that a correct assumption? > > > Robert > > On 8/26/11 3:25 AM, Davies Liu wrote: > > Hi Robert, > > Another hint to make mfsmaster more responsive is to locate the > metadata.mfs > on a separated disk with change logs, such as SAS array, then you should > modify > the source code of mfsmaster to do this. > > PS: what is the average size of you files? MooseFS (like GFS) is designed > for > large file (100M+), it can not serve well for amount of small files. > Haystack from > Facebook may be the better choice. We (douban.com) use MooseFS to serve > 200+T(1M files) offline data and beansdb [1] to serve 500 million online > small > files, it performs very well. > > [1]: http://code.google.com/p/*beansdb*/ > > Davies > > On Fri, Aug 26, 2011 at 9:08 AM, Robert Sandilands <rsa...@ne...>wrote: > >> Hi Elliot, >> >> There is nothing in the code to change the priority. >> >> Taking virtually all other load from the chunk and master servers seems >> to have improved this significantly. I still see timeouts from mfsmount, >> but not enough to be problematic. >> >> To try and optimize the performance I am experimenting with accessing >> the data using different APIs and block sizes but this has been >> inconclusive. I have tried the effect of posix_fadvise(), sendfile() and >> different sized buffers for read(). I still want to try mmap(). >> Sendfile() did seem to be slightly slower than read(). >> >> Robert >> >> On 8/24/11 11:05 AM, Elliot Finley wrote: >> > On Tue, Aug 9, 2011 at 6:46 PM, Robert Sandilands<rsa...@ne...> >> wrote: >> >> Increasing the swap space fixed the fork() issue. It seems that you >> have to >> >> ensure that memory available is always double the memory needed by >> >> mfsmaster. None of the swap space was used over the last 24 hours. >> >> >> >> This did solve the extreme comb-like behavior of mfsmaster. It still >> does >> >> not resolve its sensitivity to load on the server. I am still seeing >> >> timeouts on the chunkservers and mounts on the hour due to the high CPU >> and >> >> I/O load when the meta data is dumped to disk. It did however decrease >> >> significantly. >> > Here is another thought on this... >> > >> > The process is niced to -19 (very high priority) so that it has good >> > performance. It forks once per hour to write out the metadata. I >> > haven't checked the code for this, but is the forked process lowering >> > it's priority so it doesn't compete with the original process? >> > >> > If it's not, it should be an easy code change to lower the priority in >> > the child process (metadata writer) so that it doesn't compete with >> > the original process at the same priority. >> > >> > If you check into this, I'm sure the list would appreciate an update. :) >> > >> > Elliot >> >> >> >> ------------------------------------------------------------------------------ >> EMC VNX: the world's simplest storage, starting under $10K >> The only unified storage solution that offers unified management >> Up to 160% more powerful than alternatives and 25% more efficient. >> Guaranteed. http://p.sf.net/sfu/emc-vnx-dev2dev >> _______________________________________________ >> moosefs-users mailing list >> moo...@li... >> https://lists.sourceforge.net/lists/listinfo/moosefs-users >> > > > > -- > - Davies > > > > > ------------------------------------------------------------------------------ > Special Offer -- Download ArcSight Logger for FREE! > Finally, a world-class log management solution at an even better > price-free! And you'll get a free "Love Thy Logs" t-shirt when you > download Logger. Secure your free ArcSight Logger TODAY! > http://p.sf.net/sfu/arcsisghtdev2dev > _______________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users > > -- - Davies |
From: Robert S. <rsa...@ne...> - 2011-08-31 03:18:49
|
There is a native API? Where can I find information about it? Or do you have to reverse it from the code? Robert On 8/30/11 10:42 PM, Davies Liu wrote: > The bottle neck is FUSE and mfsmount, you should try to use native API > ( borrowed from mfsmount) > of MFS to re-implement a HTTP server, one socket per thread , or > sockets pool. > > I just want do it in Go, may by python is easier. > > Davies > > On Wed, Aug 31, 2011 at 8:54 AM, Robert Sandilands > <rsa...@ne... <mailto:rsa...@ne...>> wrote: > > Further on this subject. > > I wrote a dedicated http server to serve the files instead of > using Apache. It allowed me to gain a few extra percent of > performance and decreased the memory usage of the web servers. The > web server also gave me some interesting timings: > > File open average 405.3732 ms > File read average 238.7784 ms > File close average 286.8376 ms > File size average 0.0026 ms > Net read average 2.536 ms > Net write average 2.2148 ms > Log to access log average 0.2526 ms > Log to error log average 0.2234 ms > > Average time to process a file 936.2186 ms > Total files processed 1,503,610 > > What I really find scary is that to open a file takes nearly half > a second. To close a file a quarter of a second. The time to > open() and close() is nearly 3 times more than the time to read > the data. The server always reads in multiples of 64 kB except if > there are less data available. Although it uses posix_fadvise() to > try and do some read-ahead. This is the average over 5 machines > running mfsmount and my custom web server running for about 18 hours. > > On a machine that only serves a low number of clients the times > for open and close are negligible. open() and close() seems to > scale very badly with an increase in clients using mfsmount. > > From looking at the code for mfsmount it seems like all > communication to the master happens over a single TCP socket with > a global handle and mutex to protect it. This may be the bottle > neck? If there are multiple open()'s at the same time they may end > up waiting for the mutex to get an opportunity to communicate with > the master? The same handle and mutex is also used to read replies > and this may also not help the situation? > > What prevents multiple sockets to the master? > > It also seems to indicate that the only way to get the open() > average down is to introduce more web servers and that a single > web server can only serve a very low number of clients. Is that a > correct assumption? > > > Robert > > On 8/26/11 3:25 AM, Davies Liu wrote: >> Hi Robert, >> >> Another hint to make mfsmaster more responsive is to locate the >> metadata.mfs >> on a separated disk with change logs, such as SAS array, then you >> should modify >> the source code of mfsmaster to do this. >> >> PS: what is the average size of you files? MooseFS (like GFS) is >> designed for >> large file (100M+), it can not serve well for amount of small >> files. Haystack from >> Facebook may be the better choice. We (douban.com >> <http://douban.com>) use MooseFS to serve >> 200+T(1M files) offline data and beansdb [1] to serve 500 million >> online small >> files, it performs very well. >> >> [1]: http://code.google.com/p/ <http://code.google.com/p/>*beansdb*/ >> >> Davies >> >> On Fri, Aug 26, 2011 at 9:08 AM, Robert Sandilands >> <rsa...@ne... <mailto:rsa...@ne...>> wrote: >> >> Hi Elliot, >> >> There is nothing in the code to change the priority. >> >> Taking virtually all other load from the chunk and master >> servers seems >> to have improved this significantly. I still see timeouts >> from mfsmount, >> but not enough to be problematic. >> >> To try and optimize the performance I am experimenting with >> accessing >> the data using different APIs and block sizes but this has been >> inconclusive. I have tried the effect of posix_fadvise(), >> sendfile() and >> different sized buffers for read(). I still want to try mmap(). >> Sendfile() did seem to be slightly slower than read(). >> >> Robert >> >> On 8/24/11 11:05 AM, Elliot Finley wrote: >> > On Tue, Aug 9, 2011 at 6:46 PM, Robert >> Sandilands<rsa...@ne... >> <mailto:rsa...@ne...>> wrote: >> >> Increasing the swap space fixed the fork() issue. It seems >> that you have to >> >> ensure that memory available is always double the memory >> needed by >> >> mfsmaster. None of the swap space was used over the last >> 24 hours. >> >> >> >> This did solve the extreme comb-like behavior of >> mfsmaster. It still does >> >> not resolve its sensitivity to load on the server. I am >> still seeing >> >> timeouts on the chunkservers and mounts on the hour due to >> the high CPU and >> >> I/O load when the meta data is dumped to disk. It did >> however decrease >> >> significantly. >> > Here is another thought on this... >> > >> > The process is niced to -19 (very high priority) so that it >> has good >> > performance. It forks once per hour to write out the >> metadata. I >> > haven't checked the code for this, but is the forked >> process lowering >> > it's priority so it doesn't compete with the original process? >> > >> > If it's not, it should be an easy code change to lower the >> priority in >> > the child process (metadata writer) so that it doesn't >> compete with >> > the original process at the same priority. >> > >> > If you check into this, I'm sure the list would appreciate >> an update. :) >> > >> > Elliot >> >> >> ------------------------------------------------------------------------------ >> EMC VNX: the world's simplest storage, starting under $10K >> The only unified storage solution that offers unified management >> Up to 160% more powerful than alternatives and 25% more >> efficient. >> Guaranteed. http://p.sf.net/sfu/emc-vnx-dev2dev >> _______________________________________________ >> moosefs-users mailing list >> moo...@li... >> <mailto:moo...@li...> >> https://lists.sourceforge.net/lists/listinfo/moosefs-users >> >> >> >> >> -- >> - Davies > > > ------------------------------------------------------------------------------ > Special Offer -- Download ArcSight Logger for FREE! > Finally, a world-class log management solution at an even better > price-free! And you'll get a free "Love Thy Logs" t-shirt when you > download Logger. Secure your free ArcSight Logger TODAY! > http://p.sf.net/sfu/arcsisghtdev2dev > _______________________________________________ > moosefs-users mailing list > moo...@li... > <mailto:moo...@li...> > https://lists.sourceforge.net/lists/listinfo/moosefs-users > > > > > -- > - Davies |
From: Davies L. <dav...@gm...> - 2011-08-31 06:19:38
|
Not yet, but we can export parts of mfsmount, then create Python or Go binding of it. Davies On Wed, Aug 31, 2011 at 11:18 AM, Robert Sandilands <rsa...@ne...>wrote: > There is a native API? Where can I find information about it? Or do you > have to reverse it from the code? > > Robert > > > On 8/30/11 10:42 PM, Davies Liu wrote: > > The bottle neck is FUSE and mfsmount, you should try to use native API ( > borrowed from mfsmount) > of MFS to re-implement a HTTP server, one socket per thread , or sockets > pool. > > I just want do it in Go, may by python is easier. > > Davies > > On Wed, Aug 31, 2011 at 8:54 AM, Robert Sandilands <rsa...@ne...>wrote: > >> Further on this subject. >> >> I wrote a dedicated http server to serve the files instead of using >> Apache. It allowed me to gain a few extra percent of performance and >> decreased the memory usage of the web servers. The web server also gave me >> some interesting timings: >> >> File open average 405.3732 ms >> File read average 238.7784 ms >> File close average 286.8376 ms >> File size average 0.0026 ms >> Net read average 2.536 ms >> Net write average 2.2148 ms >> Log to access log average 0.2526 ms >> Log to error log average 0.2234 ms >> >> Average time to process a file 936.2186 ms >> Total files processed 1,503,610 >> >> What I really find scary is that to open a file takes nearly half a >> second. To close a file a quarter of a second. The time to open() and >> close() is nearly 3 times more than the time to read the data. The server >> always reads in multiples of 64 kB except if there are less data available. >> Although it uses posix_fadvise() to try and do some read-ahead. This is the >> average over 5 machines running mfsmount and my custom web server running >> for about 18 hours. >> >> On a machine that only serves a low number of clients the times for open >> and close are negligible. open() and close() seems to scale very badly with >> an increase in clients using mfsmount. >> >> From looking at the code for mfsmount it seems like all communication to >> the master happens over a single TCP socket with a global handle and mutex >> to protect it. This may be the bottle neck? If there are multiple open()'s >> at the same time they may end up waiting for the mutex to get an opportunity >> to communicate with the master? The same handle and mutex is also used to >> read replies and this may also not help the situation? >> >> What prevents multiple sockets to the master? >> >> It also seems to indicate that the only way to get the open() average down >> is to introduce more web servers and that a single web server can only serve >> a very low number of clients. Is that a correct assumption? >> >> >> Robert >> >> On 8/26/11 3:25 AM, Davies Liu wrote: >> >> Hi Robert, >> >> Another hint to make mfsmaster more responsive is to locate the >> metadata.mfs >> on a separated disk with change logs, such as SAS array, then you should >> modify >> the source code of mfsmaster to do this. >> >> PS: what is the average size of you files? MooseFS (like GFS) is >> designed for >> large file (100M+), it can not serve well for amount of small files. >> Haystack from >> Facebook may be the better choice. We (douban.com) use MooseFS to serve >> 200+T(1M files) offline data and beansdb [1] to serve 500 million online >> small >> files, it performs very well. >> >> [1]: http://code.google.com/p/*beansdb*/ >> >> Davies >> >> On Fri, Aug 26, 2011 at 9:08 AM, Robert Sandilands <rsa...@ne... >> > wrote: >> >>> Hi Elliot, >>> >>> There is nothing in the code to change the priority. >>> >>> Taking virtually all other load from the chunk and master servers seems >>> to have improved this significantly. I still see timeouts from mfsmount, >>> but not enough to be problematic. >>> >>> To try and optimize the performance I am experimenting with accessing >>> the data using different APIs and block sizes but this has been >>> inconclusive. I have tried the effect of posix_fadvise(), sendfile() and >>> different sized buffers for read(). I still want to try mmap(). >>> Sendfile() did seem to be slightly slower than read(). >>> >>> Robert >>> >>> On 8/24/11 11:05 AM, Elliot Finley wrote: >>> > On Tue, Aug 9, 2011 at 6:46 PM, Robert Sandilands< >>> rsa...@ne...> wrote: >>> >> Increasing the swap space fixed the fork() issue. It seems that you >>> have to >>> >> ensure that memory available is always double the memory needed by >>> >> mfsmaster. None of the swap space was used over the last 24 hours. >>> >> >>> >> This did solve the extreme comb-like behavior of mfsmaster. It still >>> does >>> >> not resolve its sensitivity to load on the server. I am still seeing >>> >> timeouts on the chunkservers and mounts on the hour due to the high >>> CPU and >>> >> I/O load when the meta data is dumped to disk. It did however decrease >>> >> significantly. >>> > Here is another thought on this... >>> > >>> > The process is niced to -19 (very high priority) so that it has good >>> > performance. It forks once per hour to write out the metadata. I >>> > haven't checked the code for this, but is the forked process lowering >>> > it's priority so it doesn't compete with the original process? >>> > >>> > If it's not, it should be an easy code change to lower the priority in >>> > the child process (metadata writer) so that it doesn't compete with >>> > the original process at the same priority. >>> > >>> > If you check into this, I'm sure the list would appreciate an update. >>> :) >>> > >>> > Elliot >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> EMC VNX: the world's simplest storage, starting under $10K >>> The only unified storage solution that offers unified management >>> Up to 160% more powerful than alternatives and 25% more efficient. >>> Guaranteed. http://p.sf.net/sfu/emc-vnx-dev2dev >>> _______________________________________________ >>> moosefs-users mailing list >>> moo...@li... >>> https://lists.sourceforge.net/lists/listinfo/moosefs-users >>> >> >> >> >> -- >> - Davies >> >> >> >> >> ------------------------------------------------------------------------------ >> Special Offer -- Download ArcSight Logger for FREE! >> Finally, a world-class log management solution at an even better >> price-free! And you'll get a free "Love Thy Logs" t-shirt when you >> download Logger. Secure your free ArcSight Logger TODAY! >> http://p.sf.net/sfu/arcsisghtdev2dev >> _______________________________________________ >> moosefs-users mailing list >> moo...@li... >> https://lists.sourceforge.net/lists/listinfo/moosefs-users >> >> > > > -- > - Davies > > > -- - Davies |
From: Robert S. <rsa...@ne...> - 2011-09-02 00:57:32
|
I looked at the mfsmount code. It would be a significant effort to provide a usable library/API from it that is as fully functional as mfsmount. I found a work around for the open()/close() limitation: I modified my web-server to be able to serve files from multiple mfs mounts. I changed each of the 5 web servers to mount the file system on 8 different folders having 8 instances of mfsmount running. This is a total of 40 mounts. The individual web servers would then load balance between the different mounts. It seems that if you have more than about 10 simultaneous accesses per mfsmount then you run into a significant slowdown with open() and close(). Here are the averages for a slightly shorter time period after I made this change: File Open average 13.73 ms File Read average 118.29 ms File Close average 0.44 ms File Size average 0.02 ms Net Read average 2.7 ms Net Write average 2.36 ms Log Access average 0.37 ms Log Error average 0.04 ms Average time to process a file 137.96 ms Total files processed 1,391,217 This is a significant improvement and proves, for me at least, that the handling of open() in mfsmount that is serialized with a single TCP socket is a cause for scaling issues even at low numbers of clients per mount. Another thing I noticed in the source code is in mfschunkserver. It seems like it creates 24 threads. 4 Helper threads and 2 groups of 10 worker threads. The one group handles requests from mfsmaster and is used for replication etc. The other group handles requests from mfsmount. This basically implies that you can have at most 20 simultaneous accesses to the disks controlled by a single chunk server at any specific time. Is there a reason it is that low and what would be needed to make that tunable or increase the number? Modern disk controllers work well with multiple pending requests and can re-order it to get the most performance out of your disks. SAS and SATA controllers can do this, but SAS can do it a bit better. It generally seems to get the most out of your disk subsystem if you always have a few more pending requests than spindles. Robert On 8/31/11 2:19 AM, Davies Liu wrote: > Not yet, but we can export parts of mfsmount, then create Python or Go > binding of it. > > Davies > > On Wed, Aug 31, 2011 at 11:18 AM, Robert Sandilands > <rsa...@ne... <mailto:rsa...@ne...>> wrote: > > There is a native API? Where can I find information about it? Or > do you have to reverse it from the code? > > Robert > > > On 8/30/11 10:42 PM, Davies Liu wrote: >> The bottle neck is FUSE and mfsmount, you should try to use >> native API ( borrowed from mfsmount) >> of MFS to re-implement a HTTP server, one socket per thread , or >> sockets pool. >> >> I just want do it in Go, may by python is easier. >> >> Davies >> >> On Wed, Aug 31, 2011 at 8:54 AM, Robert Sandilands >> <rsa...@ne... <mailto:rsa...@ne...>> wrote: >> >> Further on this subject. >> >> I wrote a dedicated http server to serve the files instead of >> using Apache. It allowed me to gain a few extra percent of >> performance and decreased the memory usage of the web >> servers. The web server also gave me some interesting timings: >> >> File open average 405.3732 ms >> File read average 238.7784 ms >> File close average 286.8376 ms >> File size average 0.0026 ms >> Net read average 2.536 ms >> Net write average 2.2148 ms >> Log to access log average 0.2526 ms >> Log to error log average 0.2234 ms >> >> Average time to process a file 936.2186 ms >> Total files processed 1,503,610 >> >> What I really find scary is that to open a file takes nearly >> half a second. To close a file a quarter of a second. The >> time to open() and close() is nearly 3 times more than the >> time to read the data. The server always reads in multiples >> of 64 kB except if there are less data available. Although it >> uses posix_fadvise() to try and do some read-ahead. This is >> the average over 5 machines running mfsmount and my custom >> web server running for about 18 hours. >> >> On a machine that only serves a low number of clients the >> times for open and close are negligible. open() and close() >> seems to scale very badly with an increase in clients using >> mfsmount. >> >> From looking at the code for mfsmount it seems like all >> communication to the master happens over a single TCP socket >> with a global handle and mutex to protect it. This may be the >> bottle neck? If there are multiple open()'s at the same time >> they may end up waiting for the mutex to get an opportunity >> to communicate with the master? The same handle and mutex is >> also used to read replies and this may also not help the >> situation? >> >> What prevents multiple sockets to the master? >> >> It also seems to indicate that the only way to get the open() >> average down is to introduce more web servers and that a >> single web server can only serve a very low number of >> clients. Is that a correct assumption? >> >> >> Robert >> >> On 8/26/11 3:25 AM, Davies Liu wrote: >>> Hi Robert, >>> >>> Another hint to make mfsmaster more responsive is to locate >>> the metadata.mfs >>> on a separated disk with change logs, such as SAS array, >>> then you should modify >>> the source code of mfsmaster to do this. >>> >>> PS: what is the average size of you files? MooseFS (like >>> GFS) is designed for >>> large file (100M+), it can not serve well for amount of >>> small files. Haystack from >>> Facebook may be the better choice. We (douban.com >>> <http://douban.com>) use MooseFS to serve >>> 200+T(1M files) offline data and beansdb [1] to serve 500 >>> million online small >>> files, it performs very well. >>> >>> [1]: http://code.google.com/p/ >>> <http://code.google.com/p/>*beansdb*/ >>> >>> Davies >>> >>> On Fri, Aug 26, 2011 at 9:08 AM, Robert Sandilands >>> <rsa...@ne... <mailto:rsa...@ne...>> wrote: >>> >>> Hi Elliot, >>> >>> There is nothing in the code to change the priority. >>> >>> Taking virtually all other load from the chunk and >>> master servers seems >>> to have improved this significantly. I still see >>> timeouts from mfsmount, >>> but not enough to be problematic. >>> >>> To try and optimize the performance I am experimenting >>> with accessing >>> the data using different APIs and block sizes but this >>> has been >>> inconclusive. I have tried the effect of >>> posix_fadvise(), sendfile() and >>> different sized buffers for read(). I still want to try >>> mmap(). >>> Sendfile() did seem to be slightly slower than read(). >>> >>> Robert >>> >>> On 8/24/11 11:05 AM, Elliot Finley wrote: >>> > On Tue, Aug 9, 2011 at 6:46 PM, Robert >>> Sandilands<rsa...@ne... >>> <mailto:rsa...@ne...>> wrote: >>> >> Increasing the swap space fixed the fork() issue. It >>> seems that you have to >>> >> ensure that memory available is always double the >>> memory needed by >>> >> mfsmaster. None of the swap space was used over the >>> last 24 hours. >>> >> >>> >> This did solve the extreme comb-like behavior of >>> mfsmaster. It still does >>> >> not resolve its sensitivity to load on the server. I >>> am still seeing >>> >> timeouts on the chunkservers and mounts on the hour >>> due to the high CPU and >>> >> I/O load when the meta data is dumped to disk. It did >>> however decrease >>> >> significantly. >>> > Here is another thought on this... >>> > >>> > The process is niced to -19 (very high priority) so >>> that it has good >>> > performance. It forks once per hour to write out the >>> metadata. I >>> > haven't checked the code for this, but is the forked >>> process lowering >>> > it's priority so it doesn't compete with the original >>> process? >>> > >>> > If it's not, it should be an easy code change to lower >>> the priority in >>> > the child process (metadata writer) so that it doesn't >>> compete with >>> > the original process at the same priority. >>> > >>> > If you check into this, I'm sure the list would >>> appreciate an update. :) >>> > >>> > Elliot >>> >>> >>> ------------------------------------------------------------------------------ >>> EMC VNX: the world's simplest storage, starting under $10K >>> The only unified storage solution that offers unified >>> management >>> Up to 160% more powerful than alternatives and 25% more >>> efficient. >>> Guaranteed. http://p.sf.net/sfu/emc-vnx-dev2dev >>> _______________________________________________ >>> moosefs-users mailing list >>> moo...@li... >>> <mailto:moo...@li...> >>> https://lists.sourceforge.net/lists/listinfo/moosefs-users >>> >>> >>> >>> >>> -- >>> - Davies >> >> >> ------------------------------------------------------------------------------ >> Special Offer -- Download ArcSight Logger for FREE! >> Finally, a world-class log management solution at an even better >> price-free! And you'll get a free "Love Thy Logs" t-shirt >> when you >> download Logger. Secure your free ArcSight Logger TODAY! >> http://p.sf.net/sfu/arcsisghtdev2dev >> _______________________________________________ >> moosefs-users mailing list >> moo...@li... >> <mailto:moo...@li...> >> https://lists.sourceforge.net/lists/listinfo/moosefs-users >> >> >> >> >> -- >> - Davies > > > > > -- > - Davies |
From: Robert S. <rsa...@ne...> - 2011-10-17 13:34:30
|
Hi Michal, The machine never used the swap. I verified that over several days using sar. It just needed the swap to be able to successfully fork. It seems like Linux will fail to fork if there is not enough space to keep a complete copy of the application forking even though it won't use the memory. The Linux kernel refuses to over-subscribe the memory. I verified this by looking at the fork() code in the kernel. There is a check that verifies the current amount of used memory plus the size of the application forking is less than the memory available. If that is not the case the fork() call fails. I agree that using swap should be a last desperate measure and no production system should depend on swap to operate. Even on a much faster dedicated master with significantly more RAM we still see timeouts. It just seems to be limited to around 1 minute every hour and not 5 minutes every hour. The new master has 72 GB of RAM and it currently has 125 million files. This has improved stability and has allowed me to focus on other bottlenecks in mfsmount and mfschunkserver. Robert On 10/17/11 9:00 AM, Michał Borychowski wrote: > > Hi! > > Again, this is not that easy so state that you need to double the > memory needed by mfsmaster. Fork doesn't copy the whole memory > occupied by the process. Memory used by both processes is in "copy on > write" state and you need only space for "differences". We estimate > that for the master which makes lots of operations it would be > neccessary to have 30-40% extra of memory normally used by the process. > > And in the long run increasing swap is not good. When master starts to > use it too much during saves, it may happen that the whole system will > hung up. Probably that's why you have these timeouts. To be honest you > should increase physical RAM and not the swap. (We had 16GB RAM and it > started to be not enough when master needed 13GB - we had to put more > RAM then). > > Kind regards > > Michał Borychowski > > MooseFS Support Manager > > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > > Gemius S.A. > > ul. Wołoska 7, 02-672 Warszawa > > Budynek MARS, klatka D > > Tel.: +4822 874-41-00 > > Fax : +4822 874-41-01 > > *From:*Robert Sandilands [mailto:rsa...@ne...] > *Sent:* Wednesday, August 10, 2011 3:12 PM > *To:* moo...@li... > *Subject:* Re: [Moosefs-users] mfsmaster performance and hardware > > Hi Laurent, > > Due to the use of ktune a lot of values are already tweaked. For > example file-max. I don't have iptables loaded as I measured at some > stage that conntrack was -really- slow with large numbers of connections. > > I am not seeing gc_threshold related log messages but I can't see any > reason not to tweak that. > > Robert > > On 8/10/11 2:20 AM, Laurent Wandrebeck wrote: > > On Tue, 09 Aug 2011 20:46:45 -0400 > Robert Sandilands<rsa...@ne...> <mailto:rsa...@ne...> wrote: > > > Increasing the swap space fixed the fork() issue. It seems that you have > > to ensure that memory available is always double the memory needed by > > mfsmaster. None of the swap space was used over the last 24 hours. > > > > This did solve the extreme comb-like behavior of mfsmaster. It still > > does not resolve its sensitivity to load on the server. I am still > > seeing timeouts on the chunkservers and mounts on the hour due to the > > high CPU and I/O load when the meta data is dumped to disk. It did > > however decrease significantly. > > > > An example from the logs: > > > > Aug 9 04:03:38 http-lb-1 mfsmount[13288]: master: tcp recv error: > > ETIMEDOUT (Operation timed out) (1) > > Aug 9 04:03:39 http-lb-1 mfsmount[13288]: master: register error (read > > header: ETIMEDOUT (Operation timed out)) > > Aug 9 04:03:41 http-lb-1 mfsmount[13288]: registered to master > > > Hi, > what if you apply these tweaks to ip stack on master/CS/metaloggers ? > # to avoid problems with heavily loaded servers > echo 16000> /proc/sys/fs/file-max > echo 100000> /proc/sys/net/ipv4/ip_conntrack_max > > # to avoid Neighbour table overflow > echo "512"> /proc/sys/net/ipv4/neigh/default/gc_thresh1 > echo "2048"> /proc/sys/net/ipv4/neigh/default/gc_thresh2 > echo "4048"> /proc/sys/net/ipv4/neigh/default/gc_thresh3 > > No need to restart anything, these can be applied on the fly without > disturbing services. > HTH, > > > > > ------------------------------------------------------------------------------ > uberSVN's rich system and user administration capabilities and model > configuration take the hassle out of deploying and managing Subversion and > the tools developers use with it. Learn more about uberSVN and get a free > download at:http://p.sf.net/sfu/wandisco-dev2dev > > > > > _______________________________________________ > moosefs-users mailing list > moo...@li... <mailto:moo...@li...> > https://lists.sourceforge.net/lists/listinfo/moosefs-users > |
From: Elliot F. <efi...@gm...> - 2011-10-17 14:20:03
|
2011/10/17 Robert Sandilands <rsa...@ne...>: > The new master has 72 GB of RAM and it currently > has 125 million files. Just out of curiosity (and to plan my mfsmaster upgrade), how much RAM does the mfsmaster process use for 125 million files? Thanks, Elliot |