Re: [Moosefs-users] mfsmaster performance and hardware

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

I looked at the mfsmount code. It would be a significant effort to 
provide a usable library/API from it that is as fully functional as 
mfsmount.

I found a work around for the open()/close() limitation:

I modified my web-server to be able to serve files from multiple mfs 
mounts. I changed each of the 5 web servers to mount the file system on 
8 different folders having 8 instances of mfsmount running. This is a 
total of 40 mounts. The individual web servers would then load balance 
between the different mounts.

It seems that if you have more than about 10 simultaneous accesses per 
mfsmount then you run into a significant slowdown with open() and 
close(). Here are the averages for a slightly shorter time period after 
I made this change:

File Open average    13.73    ms
File Read average    118.29    ms
File Close average    0.44    ms
File Size average    0.02    ms
Net Read average    2.7    ms
Net Write average    2.36    ms
Log Access average    0.37    ms
Log Error average    0.04    ms

Average time to process a file    137.96 ms
Total files processed    1,391,217

This is a significant improvement and proves, for me at least, that the 
handling of open() in mfsmount that is serialized with a single TCP 
socket is a cause for scaling issues even at low numbers of clients per 
mount.

Another thing I noticed in the source code is in mfschunkserver. It 
seems like it creates 24 threads. 4 Helper threads and 2 groups of 10 
worker threads. The one group handles requests from mfsmaster and is 
used for replication etc. The other group handles requests from 
mfsmount. This basically implies that you can have at most 20 
simultaneous accesses to the disks controlled by a single chunk server 
at any specific time. Is there a reason it is that low and what would be 
needed to make that tunable or increase the number?

Modern disk controllers work well with multiple pending requests and can 
re-order it to get the most performance out of your disks. SAS and SATA 
controllers can do this, but SAS can do it a bit better. It generally 
seems to get the most out of your disk subsystem if you always have a 
few more pending requests than spindles.

Robert

On 8/31/11 2:19 AM, Davies Liu wrote:
> Not yet, but we can export parts of mfsmount, then create Python or Go 
> binding of it.
>
> Davies
>
> On Wed, Aug 31, 2011 at 11:18 AM, Robert Sandilands 
> <rsa...@ne... <mailto:rsa...@ne...>> wrote:
>
>     There is a native API? Where can I find information about it? Or
>     do you have to reverse it from the code?
>
>     Robert
>
>
>     On 8/30/11 10:42 PM, Davies Liu wrote:
>>     The bottle neck is FUSE and mfsmount, you should try to use
>>     native API ( borrowed from mfsmount)
>>     of MFS to re-implement a HTTP server, one socket per thread , or
>>     sockets pool.
>>
>>     I just want do it in Go, may by python is easier.
>>
>>     Davies
>>
>>     On Wed, Aug 31, 2011 at 8:54 AM, Robert Sandilands
>>     <rsa...@ne... <mailto:rsa...@ne...>> wrote:
>>
>>         Further on this subject.
>>
>>         I wrote a dedicated http server to serve the files instead of
>>         using Apache. It allowed me to gain a few extra percent of
>>         performance and decreased the memory usage of the web
>>         servers. The web server also gave me some interesting timings:
>>
>>         File open average     405.3732     ms
>>         File read average     238.7784     ms
>>         File close average     286.8376     ms
>>         File size average     0.0026     ms
>>         Net read average     2.536     ms
>>         Net write average     2.2148     ms
>>         Log to access log average     0.2526     ms
>>         Log to error log average     0.2234     ms
>>
>>         Average time to process a file     936.2186     ms
>>         Total files processed     1,503,610
>>
>>         What I really find scary is that to open a file takes nearly
>>         half a second. To close a file a quarter of a second. The
>>         time to open() and close() is nearly 3 times more than the
>>         time to read the data. The server always reads in multiples
>>         of 64 kB except if there are less data available. Although it
>>         uses posix_fadvise() to try and do some read-ahead. This is
>>         the average over 5 machines running mfsmount and my custom
>>         web server running for about 18 hours.
>>
>>         On a machine that only serves a low number of clients the
>>         times for open and close are negligible. open() and close()
>>         seems to scale very badly with an increase in clients using
>>         mfsmount.
>>
>>         From looking at the code for mfsmount it seems like all
>>         communication to the master happens over a single TCP socket
>>         with a global handle and mutex to protect it. This may be the
>>         bottle neck? If there are multiple open()'s at the same time
>>         they may end up waiting for the mutex to get an opportunity
>>         to communicate with the master? The same handle and mutex is
>>         also used to read replies and this may also not help the
>>         situation?
>>
>>         What prevents multiple sockets to the master?
>>
>>         It also seems to indicate that the only way to get the open()
>>         average down is to introduce more web servers and that a
>>         single web server can only serve a very low number of
>>         clients. Is that a correct assumption?
>>
>>
>>         Robert
>>
>>         On 8/26/11 3:25 AM, Davies Liu wrote:
>>>         Hi Robert,
>>>
>>>         Another hint to make mfsmaster more responsive is to locate
>>>         the metadata.mfs
>>>         on a separated disk with change logs, such as SAS array,
>>>         then you should modify
>>>         the source code of mfsmaster to do this.
>>>
>>>         PS: what is the average size of you files? MooseFS (like
>>>         GFS) is designed for
>>>         large file (100M+), it can not serve well for amount of
>>>         small files.  Haystack from
>>>         Facebook may be the better choice. We (douban.com
>>>         <http://douban.com>) use MooseFS to serve
>>>         200+T(1M files) offline data and beansdb [1] to serve 500
>>>         million  online small
>>>         files, it performs very well.
>>>
>>>         [1]: http://code.google.com/p/
>>>         <http://code.google.com/p/>*beansdb*/
>>>
>>>         Davies
>>>
>>>         On Fri, Aug 26, 2011 at 9:08 AM, Robert Sandilands
>>>         <rsa...@ne... <mailto:rsa...@ne...>> wrote:
>>>
>>>             Hi Elliot,
>>>
>>>             There is nothing in the code to change the priority.
>>>
>>>             Taking virtually all other load from the chunk and
>>>             master servers seems
>>>             to have improved this significantly. I still see
>>>             timeouts from mfsmount,
>>>             but not enough to  be problematic.
>>>
>>>             To try and optimize the performance I am experimenting
>>>             with accessing
>>>             the data using different APIs and block sizes but this
>>>             has been
>>>             inconclusive. I have tried the effect of
>>>             posix_fadvise(), sendfile() and
>>>             different sized buffers for read(). I still want to try
>>>             mmap().
>>>             Sendfile() did seem to be slightly slower than read().
>>>
>>>             Robert
>>>
>>>             On 8/24/11 11:05 AM, Elliot Finley wrote:
>>>             > On Tue, Aug 9, 2011 at 6:46 PM, Robert
>>>             Sandilands<rsa...@ne...
>>>             <mailto:rsa...@ne...>>  wrote:
>>>             >> Increasing the swap space fixed the fork() issue. It
>>>             seems that you have to
>>>             >> ensure that memory available is always double the
>>>             memory needed by
>>>             >> mfsmaster. None of the swap space was used over the
>>>             last 24 hours.
>>>             >>
>>>             >> This did solve the extreme comb-like behavior of
>>>             mfsmaster. It still does
>>>             >> not resolve its sensitivity to load on the server. I
>>>             am still seeing
>>>             >> timeouts on the chunkservers and mounts on the hour
>>>             due to the high CPU and
>>>             >> I/O load when the meta data is dumped to disk. It did
>>>             however decrease
>>>             >> significantly.
>>>             > Here is another thought on this...
>>>             >
>>>             > The process is niced to -19 (very high priority) so
>>>             that it has good
>>>             > performance.  It forks once per hour to write out the
>>>             metadata.  I
>>>             > haven't checked the code for this, but is the forked
>>>             process lowering
>>>             > it's priority so it doesn't compete with the original
>>>             process?
>>>             >
>>>             > If it's not, it should be an easy code change to lower
>>>             the priority in
>>>             > the child process (metadata writer) so that it doesn't
>>>             compete with
>>>             > the original process at the same priority.
>>>             >
>>>             > If you check into this, I'm sure the list would
>>>             appreciate an update. :)
>>>             >
>>>             > Elliot
>>>
>>>
>>>             ------------------------------------------------------------------------------
>>>             EMC VNX: the world's simplest storage, starting under $10K
>>>             The only unified storage solution that offers unified
>>>             management
>>>             Up to 160% more powerful than alternatives and 25% more
>>>             efficient.
>>>             Guaranteed. http://p.sf.net/sfu/emc-vnx-dev2dev
>>>             _______________________________________________
>>>             moosefs-users mailing list
>>>             moo...@li...
>>>             <mailto:moo...@li...>
>>>             https://lists.sourceforge.net/lists/listinfo/moosefs-users
>>>
>>>
>>>
>>>
>>>         -- 
>>>          - Davies
>>
>>
>>         ------------------------------------------------------------------------------
>>         Special Offer -- Download ArcSight Logger for FREE!
>>         Finally, a world-class log management solution at an even better
>>         price-free! And you'll get a free "Love Thy Logs" t-shirt
>>         when you
>>         download Logger. Secure your free ArcSight Logger TODAY!
>>         http://p.sf.net/sfu/arcsisghtdev2dev
>>         _______________________________________________
>>         moosefs-users mailing list
>>         moo...@li...
>>         <mailto:moo...@li...>
>>         https://lists.sourceforge.net/lists/listinfo/moosefs-users
>>
>>
>>
>>
>>     -- 
>>      - Davies
>
>
>
>
> -- 
>  - Davies

Re: [Moosefs-users] mfsmaster performance and hardware

Fault tolerant, POSIX-compliant, Net Distributed Storage / File System

Re: [Moosefs-users] mfsmaster performance and hardware