You can subscribe to this list here.
2009 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(4) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2010 |
Jan
(20) |
Feb
(11) |
Mar
(11) |
Apr
(9) |
May
(22) |
Jun
(85) |
Jul
(94) |
Aug
(80) |
Sep
(72) |
Oct
(64) |
Nov
(69) |
Dec
(89) |
2011 |
Jan
(72) |
Feb
(109) |
Mar
(116) |
Apr
(117) |
May
(117) |
Jun
(102) |
Jul
(91) |
Aug
(72) |
Sep
(51) |
Oct
(41) |
Nov
(55) |
Dec
(74) |
2012 |
Jan
(45) |
Feb
(77) |
Mar
(99) |
Apr
(113) |
May
(132) |
Jun
(75) |
Jul
(70) |
Aug
(58) |
Sep
(58) |
Oct
(37) |
Nov
(51) |
Dec
(15) |
2013 |
Jan
(28) |
Feb
(16) |
Mar
(25) |
Apr
(38) |
May
(23) |
Jun
(39) |
Jul
(42) |
Aug
(19) |
Sep
(41) |
Oct
(31) |
Nov
(18) |
Dec
(18) |
2014 |
Jan
(17) |
Feb
(19) |
Mar
(39) |
Apr
(16) |
May
(10) |
Jun
(13) |
Jul
(17) |
Aug
(13) |
Sep
(8) |
Oct
(53) |
Nov
(23) |
Dec
(7) |
2015 |
Jan
(35) |
Feb
(13) |
Mar
(14) |
Apr
(56) |
May
(8) |
Jun
(18) |
Jul
(26) |
Aug
(33) |
Sep
(40) |
Oct
(37) |
Nov
(24) |
Dec
(20) |
2016 |
Jan
(38) |
Feb
(20) |
Mar
(25) |
Apr
(14) |
May
(6) |
Jun
(36) |
Jul
(27) |
Aug
(19) |
Sep
(36) |
Oct
(24) |
Nov
(15) |
Dec
(16) |
2017 |
Jan
(8) |
Feb
(13) |
Mar
(17) |
Apr
(20) |
May
(28) |
Jun
(10) |
Jul
(20) |
Aug
(3) |
Sep
(18) |
Oct
(8) |
Nov
|
Dec
(5) |
2018 |
Jan
(15) |
Feb
(9) |
Mar
(12) |
Apr
(7) |
May
(123) |
Jun
(41) |
Jul
|
Aug
(14) |
Sep
|
Oct
(15) |
Nov
|
Dec
(7) |
2019 |
Jan
(2) |
Feb
(9) |
Mar
(2) |
Apr
(9) |
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
(6) |
Oct
(1) |
Nov
(12) |
Dec
(2) |
2020 |
Jan
(2) |
Feb
|
Mar
|
Apr
(3) |
May
|
Jun
(4) |
Jul
(4) |
Aug
(1) |
Sep
(18) |
Oct
(2) |
Nov
|
Dec
|
2021 |
Jan
|
Feb
(3) |
Mar
|
Apr
|
May
|
Jun
|
Jul
(6) |
Aug
|
Sep
(5) |
Oct
(5) |
Nov
(3) |
Dec
|
2022 |
Jan
|
Feb
|
Mar
(3) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: 陈钢 <yik...@gm...> - 2012-03-06 10:36:27
|
i <moo...@li...>`m crazy, my boss will kill me .. on mfsmaster machine, i execute "mfsmaster restart", then it crashed. err info is here: ====== working directory: /var/lib/mfs lockfile created and locked initializing mfsmaster modules ... loading sessions ... ok sessions file has been loaded exports file has been loaded loading metadata ... loading objects (files,directories,etc.) ... loading node: read error: ENOENT (No such file or directory) error init: file system manager failed !!! error occured during initialization - exiting ======= no "metadata.mfs_back" left for me . i think it because there is no space left on mfsmaster hard disk. then i log in mfsmetalogger server. i execute "mfsmetarestore -a -d /var/lib/mfs", and i also report err! err info is here: ====== file 'metadata.mfs.back' not found - will try 'metadata_ml.mfs.back' instead loading objects (files,directories,etc.) ... loading node: read error: ENOENT (No such file or directory) error can't read metadata from file: .//metadata_ml.mfs.back ======= what can i do now? PLEASE SAVE MY LIFE..... |
From: Michał B. <mic...@ge...> - 2012-03-06 10:22:36
|
Hi! I'm curious whether you made some tests with your solution? What was the performance gain? Was it like 20-30% or rather 2-3%? Actually here we are quite skeptical about this. In one of your emails you suggest to do fsync more rarely (eg. every 30 seconds). But when CS is closed cleanly, OS will do all fsyncs before closing the files but when CS is not closed cleanly so how should we know which files to test? CS upon startup doesn't do 'stat' on every file (it took too long time). So it won't know which files to check. We could create some extra file (eg. named '.dirty') where we could save id of the file upon opening it (and do fsync on the '.dirty' file). Upon file closing we delete its id from '.dirty' file. When CS closes cleanly, '.dirty' file should be empty. If not, upon starting of CS, it reads the '.dirty' file and it scans all the chunks which are saved in this file. You also gave some suggestions to use this options: 1. FLUSH_ON_WRITE - option easy to implement, but not that secure 2. FLUSH_DELAY - as above 3. CHECKSUM_INITIAL - this would mean to read all chunks on all the disks upon startup which is just impossible (in some environments would take more than 24hrs). And still we are afraid that there may be scenario that malfunction of CS without fsyncs could cause that chunk will "return" to a proper form (in sense of CRC) from before the save. It will mean that there would be several "proper" copies of the same chunk, but with different content - we cannot allow this to happen. Possible it would be necessary to inform master server that CS has some 'unfsynced' chunks. So this gets still more complicated. That's why we are curious whether the performance gain is substantial enough to start doing this fine tuning. Kind regards Michał Borychowski MooseFS Support Manager |
From: Chris P. <ch...@ec...> - 2012-03-05 09:32:00
|
On Mon, 2012-03-05 at 17:20 +0800, Davies Liu wrote: > Hi Chris! > > > close(c->fd) was delayed in hdd_delayed_ops(), OPENDELAY seconds > after > hdd_io_end() is called. It can reduce the open/close calls when many > clients want > read/write the same chunk. > > > You can #define OPENSTEPS as 0 to disable this feature. > > > So, fsync() should be moved into hdd_delayed_ops() before close(), > or when OPENSTEPS is defined to 0. Thanks That makes sense I will do some testing with that patch Chris > > Davies > > 2012/3/5 Chris Picton <ch...@ec...> > Hi > > Even with that patch, the close((c->fd) is never run in > hdd_io_end, as the test OPENSTEPS==0 is never true. This is > because the value for OPENSTEPS never changes > > Should the logic not look like: > > The following patch logic makes more sense to me (ignoring > current fsync location) > > It does, however, change the logic of when close() is called, > as currently, the logic to close the fd in hdd_io_end is never > called. > > > > diff -uNr mfs-1.6.20.orig/mfschunkserver/hddspacemgr.c > mfs-1.6.20.opensteps/mfschunkserver/hddspacemgr.c > --- mfs-1.6.20.orig/mfschunkserver/hddspacemgr.c 2011-01-10 > 13:34:22.000000000 +0200 > +++ mfs-1.6.20.opensteps/mfschunkserver/hddspacemgr.c > 2012-03-05 11:08:13.983795552 +0200 > @@ -758,7 +758,7 @@ > c->filename = NULL; > c->blocks = 0; > c->crcrefcount = 0; > - c->opensteps = 0; > + c->opensteps = OPENSTEPS; > c->crcsteps = 0; > c->crcchanged = 0; > c->fd = -1; > @@ -831,7 +831,7 @@ > c->filename = NULL; > c->blocks = 0; > c->crcrefcount = 0; > - c->opensteps = 0; > + c->opensteps = OPENSTEPS; > c->crcsteps = 0; > c->crcchanged = 0; > c->fd = -1; > @@ -1659,15 +1659,15 @@ > > } > c->crcrefcount--; > if (c->crcrefcount==0) { > > - if (OPENSTEPS==0) { > - if (close(c->fd)<0) { > + if (c->opensteps==0) { > + if (close(c->fd)<0) { //close descriptor > c->fd = -1; > mfs_arg_errlog_silent(LOG_WARNING,"hdd_io_end: file:%s - close > error",c->filename); > return ERROR_IO; > } > c->fd = -1; > } else { > - c->opensteps = OPENSTEPS; > + c->opensteps--; //decrease opensteps > } > c->crcsteps = CRCSTEPS; > #ifdef PRESERVE_BLOCK > > > > > > On Sat, 2012-03-03 at 10:25 +0800, Davies Liu wrote: > > new patch: > > > > > > --- mfs-1.6.26/mfschunkserver/hddspacemgr.c 2012-02-28 > > 16:18:26.000000000 +0800 > > +++ mfs-1.6.26-r4/mfschunkserver/hddspacemgr.c 2012-03-02 > > 20:33:29.000000000 +0800 > > @@ -1691,9 +1691,16 @@ > > } > > } > > > > +static inline uint64_t get_usectime() { > > + struct timeval tv; > > + gettimeofday(&tv,NULL); > > + return ((uint64_t)(tv.tv_sec))*1000000+tv.tv_usec; > > +} > > + > > void hdd_delayed_ops() { > > dopchunk **ccp,*cc,*tcc; > > uint32_t dhashpos; > > + uint64_t ts,te; > > chunk *c; > > // int status; > > // printf("delayed ops: before lock\n"); > > @@ -1756,6 +1763,22 @@ > > if (c->opensteps>0) { // > > decrease counter > > c->opensteps--; > > } else if (c->fd>=0) { // > > close descriptor > > + ts = get_usectime(); > > +#ifdef F_FULLFSYNC > > + if > > (fcntl(c->fd,F_FULLFSYNC)<0) { > > + int errmem = > > errno; > > + > > mfs_arg_errlog_silent(LOG_WARNING,"hdd_delayed_ops: file:%s > > - fsync (via fcntl) error",c->filename); > > + errno = > > errmem; > > + } > > +#else > > + if (fsync(c->fd)<0) > > { > > + int errmem = > > errno; > > + > > mfs_arg_errlog_silent(LOG_WARNING,"hdd_delayed_ops: file:%s > > - fsync (direct call) error",c->filename); > > + errno = > > errmem; > > + } > > +#endif > > + te = get_usectime(); > > + > > hdd_stats_datafsync(c->owner,te-ts); > > if (close(c->fd)<0) > > { > > > > hdd_error_occured(c); // uses and preserves errno !!! > > > > mfs_arg_errlog_silent(LOG_WARNING,"hdd_delayed_ops: file:%s > > - close error",c->filename); > > @@ -1792,12 +1815,6 @@ > > // printf("delayed ops: after unlock\n"); > > } > > > > -static inline uint64_t get_usectime() { > > - struct timeval tv; > > - gettimeofday(&tv,NULL); > > - return ((uint64_t)(tv.tv_sec))*1000000+tv.tv_usec; > > -} > > - > > static int hdd_io_begin(chunk *c,int newflag) { > > dopchunk *cc; > > int status; > > @@ -1891,28 +1908,27 @@ > > errno = errmem; > > return status; > > } > > - ts = get_usectime(); > > -#ifdef F_FULLFSYNC > > - if (fcntl(c->fd,F_FULLFSYNC)<0) { > > - int errmem = errno; > > - > > mfs_arg_errlog_silent(LOG_WARNING,"hdd_io_end: file:%s - > > fsync (via fcntl) error",c->filename); > > - errno = errmem; > > - return ERROR_IO; > > - } > > -#else > > - if (fsync(c->fd)<0) { > > - int errmem = errno; > > - > > mfs_arg_errlog_silent(LOG_WARNING,"hdd_io_end: file:%s - > > fsync (direct call) error",c->filename); > > - errno = errmem; > > - return ERROR_IO; > > - } > > -#endif > > - te = get_usectime(); > > - hdd_stats_datafsync(c->owner,te-ts); > > } > > c->crcrefcount--; > > if (c->crcrefcount==0) { > > if (OPENSTEPS==0) { > > + ts = get_usectime(); > > +#ifdef F_FULLFSYNC > > + if (fcntl(c->fd,F_FULLFSYNC)<0) { > > + int errmem = errno; > > + > > mfs_arg_errlog_silent(LOG_WARNING,"hdd_io_end: file:%s - > > fsync (via fcntl) error",c->filename); > > + errno = errmem; > > + } > > +#else > > + if (fsync(c->fd)<0) { > > + int errmem = errno; > > + > > mfs_arg_errlog_silent(LOG_WARNING,"hdd_io_end: file:%s - > > fsync (direct call) error",c->filename); > > + errno = errmem; > > + } > > +#endif > > + te = get_usectime(); > > + hdd_stats_datafsync(c->owner,te-ts); > > + > > if (close(c->fd)<0) { > > int errmem = errno; > > c->fd = -1; > > @@ -3766,6 +3782,7 @@ > > // } > > prevf = NULL; > > c = hdd_chunk_get(chunkid,CH_NEW_AUTO); > > + if (c == NULL) return; > > if (c->filename!=NULL) { // already have this > > chunk > > if (version <= c->version) { // current > > chunk is older > > if (todel<2) { // this is R/W fs? > > > > 2012/3/1 Chris Picton <ch...@ec...> > > In mfs-1.6.20, the OPENSTEPS vs fd closing logic > > appears a bit flawed > > > > The test is made in hdd_io_end: > > > > if (OPENSTEPS==0) { > > > > However, OPENSTEPS always is > 0, as it is > > initialised once at the top > > of the file, with a #define > > > > > > This means the file descriptors never get closed (by > > that logic) > > > > > > They are closed in hdd_delayed_ops() with 5 seconds delay. > > So I move the fsync() into hdd_delayed_ops(), just before > > close(). > > And, fsync() in hdd_term() before close() is also needed. > > (not in the patch). > > > > Am I reading the code correctly? > > > > Chris > > > > > > On Thu, 2012-03-01 at 16:51 +0800, Davies Liu wrote: > > > I can not figure it out how to make fsync > > frequency configurable, then > > > move fsync() just before close(): > > > > > > > > > --- mfs-1.6.26/mfschunkserver/hddspacemgr.c > > 2012-02-08 > > > 16:15:03.000000000 +0800 > > > +++ mfs-1.6.26-r1/mfschunkserver/hddspacemgr.c > > 2012-03-01 > > > 16:17:23.000000000 +0800 > > > @@ -1887,28 +1887,28 @@ > > > errno = errmem; > > > return status; > > > } > > > - ts = get_usectime(); > > > -#ifdef F_FULLFSYNC > > > - if (fcntl(c->fd,F_FULLFSYNC)<0) { > > > - int errmem = errno; > > > - > > mfs_arg_errlog_silent(LOG_WARNING,"hdd_io_end: > > > file:%s - fsync (via fcntl) error",c->filename); > > > - errno = errmem; > > > - return ERROR_IO; > > > - } > > > -#else > > > - if (fsync(c->fd)<0) { > > > - int errmem = errno; > > > - > > mfs_arg_errlog_silent(LOG_WARNING,"hdd_io_end: > > > file:%s - fsync (direct call) error",c->filename); > > > - errno = errmem; > > > - return ERROR_IO; > > > - } > > > -#endif > > > - te = get_usectime(); > > > - > > hdd_stats_datafsync(c->owner,te-ts); > > > } > > > c->crcrefcount--; > > > if (c->crcrefcount==0) { > > > if (OPENSTEPS==0) { > > > + ts = get_usectime(); > > > +#ifdef F_FULLFSYNC > > > + if > > (fcntl(c->fd,F_FULLFSYNC)<0) { > > > + int errmem = > > errno; > > > + > > > mfs_arg_errlog_silent(LOG_WARNING,"hdd_io_end: > > file:%s - fsync (via > > > fcntl) error",c->filename); > > > + errno = errmem; > > > + return ERROR_IO; > > > + } > > > +#else > > > + if (fsync(c->fd)<0) { > > > + int errmem = > > errno; > > > + > > > mfs_arg_errlog_silent(LOG_WARNING,"hdd_io_end: > > file:%s - fsync (direct > > > call) error",c->filename); > > > + errno = errmem; > > > + return ERROR_IO; > > > + } > > > +#endif > > > + te = get_usectime(); > > > + > > hdd_stats_datafsync(c->owner,te-ts); > > > if (close(c->fd)<0) { > > > int errmem = > > errno; > > > c->fd = -1; > > > > > > On Thu, Mar 1, 2012 at 12:21 PM, Chris Picton > > <ch...@ec...> > > > wrote: > > > I have had similar ideas > > > > > > I currently have a patch to disable fsync > > on every block > > > close, however this will probably lead to > > data corruption if > > > there is a site-wide power outage. > > > > > > My thoughts are as follows: > > > > > * Create config variable > > FLUSH_ON_WRITE (0/1) to disable > > > or enable flush on write > > > > > * Create config variable FLUSH_DELAY > > (seconds) to > > > prevent flushing immediately - > > rather the pointers to > > > the written chunks would be > > stored, and looped through > > > (in a separate thread?), to flush > > any which are older > > > than the delay. This would ensure > > that the chunks > > > have a maximum time during which > > they may potentially > > > be invalid on disk. If the > > FLUSH_DELAY is 0, then > > > behaviour is as current > > > > > * Create config variable > > CHECKSUM_INITIAL (0/1) If set > > > > > to 1 would force a checksum of > > *all* blocks on a > > > chunkserver at startup, to find > > potentially bad chunks > > > before they are used. Is this > > necessary, though? Are > > > checksums read on every block > > read? > > > I may start making the above patches, if I > > get time to do so. > > > > > > Chris > > > > > > > > > > > > > > > On 2012/03/01 6:09 AM, Davies Liu wrote: > > > > Hi Michal! > > > > > > > > > > > > I have found the reason for bad write > > performance, the chunk > > > > server will write the crc block > > > > into disk EVERY second, then fsync(), > > which will take about > > > > 28ms. > > > > > > > > Can I reduce frequency of fsync() to > > every minute, then > > > > check the chunk modified in recent > > > > 30 minutes after booting? > > > > > > > > Davies > > > > > > > > 2012/2/23 Michał Borychowski > > <mic...@ge...> > > > > Hi Davies! > > > > > > > > Here is our analysis of this > > situation. Different > > > > files are written simultaneously > > on the same CS - > > > > that's why pwrites are written > > do different files. > > > > Block size of 64kB is not that > > small. Writes in OS > > > > are sent through write cache so > > all saves being > > > > multiplication of 4096B should > > work equally fast. > > > > > > > > Our tests: > > > > > > > > dd on Linux 64k : 640k > > > > $ dd if=/dev/zero of=/tmp/test > > bs=64k count=10000 > > > > 10000+0 records in > > > > 10000+0 records out > > > > 655360000 bytes (655 MB) copied, > > 22.1836 s, 29.5 > > > > MB/s > > > > > > > > $ dd if=/dev/zero of=/tmp/test > > bs=640k count=1000 > > > > 1000+0 records in > > > > 1000+0 records out > > > > 655360000 bytes (655 MB) copied, > > 23.1311 s, 28.3 > > > > MB/s > > > > > > > > dd on Mac OS X 64k : 640k > > > > $ dd if=/dev/zero of=/tmp/test > > bs=64k count=10000 > > > > 10000+0 records in > > > > 10000+0 records out > > > > 655360000 bytes transferred in > > 14.874652 secs > > > > (44058846 bytes/sec) > > > > > > > > $ dd if=/dev/zero of=/tmp/test > > bs=640k count=1000 > > > > 1000+0 records in > > > > 1000+0 records out > > > > 655360000 bytes transferred in > > 14.578427 secs > > > > (44954096 bytes/sec) > > > > > > > > So the times are similar. Saves > > going to different > > > > files also should not be a > > problem as a kernel > > > > scheduler takes care of this. > > > > > > > > If you have some specific idea > > how to improve the > > > > saves please share it with us. > > > > > > > > > > > > Kind regards > > > > Michał > > > > > > > > -----Original Message----- > > > > From: Davies Liu > > [mailto:dav...@gm...] > > > > Sent: Wednesday, February 22, > > 2012 8:24 AM > > > > To: > > moo...@li... > > > > Subject: [Moosefs-users] Bad > > write performance of > > > > mfschunkserver > > > > > > > > Hi,devs: > > > > > > > > Today, We found that some > > mfschunkserver were not > > > > responsive, caused many timeout > > in mfsmount, then > > > > all the write operation were > > blocked. > > > > > > > > After some digging, we found > > that there were some > > > > small but continuous write > > bandwidth, strace show > > > > that many small pwrite() between > > several files: > > > > > > > > [pid 7087] 12:28:28 pwrite(19, > > "baggins3 > > > > 60.210.18.235 sE7NtNQU7"..., > > 25995, 55684725 > > > > <unfinished ...> [pid 7078] > > 12:28:28 pwrite(17, > > > > "2012/02/22 12:28:28:root: > > WARNIN"..., 69, 21768909 > > > > <unfinished ...> [pid 7080] > > 12:28:28 pwrite(20, > > > > "gardner4 183.7.50.169 mr5vi > > +Z4H3"..., 47663, > > > > 34550257 <unfinished ...> [pid > > 7079] 12:28:28 > > > > pwrite(19, "\" \"Mozilla/5.0 > > (Windows NT 6.1) "..., > > > > 40377, 55710720 <unfinished ...> > > [pid 7086] > > > > 12:28:28 pwrite(23, "MATP; > > InfoPath.2; .NET4.0C; > > > > 360S"..., 65536, 6427648 > > <unfinished ...> [pid > > > > 7082] 12:28:28 pwrite(23, "; > > GTB7.2; SLCC2; .NET > > > > CLR 2.0.50"..., 65536, 6493184 > > <unfinished ...> > > > > [pid 7083] 12:28:28 pwrite(20, > > "\255BYU\355\237\347 > > > > \226s\261\307N{A\355\203S\306 > > \244\255\322[\322\rJ > > > > \32[z3\31\311\327"..., > > > > 4096, 1024 <unfinished ...> > > > > [pid 7078] 12:28:28 pwrite(23, > > > > > > "ovie/subject/4724373/reviews?sta"..., > > > > 65536, 6558720 <unfinished ...> > > > > [pid 7080] 12:28:28 pwrite(19, > > > > "[\"[\4\5\266v\324\366\245n\t > > \315\202\227\\\343= > > > > \336-\r > > > > k)\316\354\335\353\373\340 > > \331;"..., 4096, 1024 > > > > <unfinished ...> [pid 7079] > > 12:28:28 pwrite(23, > > > > "ta-Python/2.0.15\" > > > > 0.016\n211.147."..., 65536, > > 6624256 <unfinished ...> > > > > [pid 7081] 12:28:28 pwrite(23, > > > > > > "4034093?apikey=0eb695f25995d7eb2"..., > > > > 65536, 6689792 <unfinished ...> > > > > [pid 7084] 12:28:28 pwrite(23, > > " > > > > > > y8G23n95BKY:43534427:wind8vssc4"..., > > > > 65536, 6755328) = 65536 > > <0.000108> > > > > [pid 7078] 12:28:28 pwrite(23, > > > > "TkVvKuXfug:3248233:5Yo9vFoOIuo > > \""..., 65536, > > > > 6820864 <unfinished ...> [pid > > 7086] 12:28:28 > > > > pwrite(23, ":s|1563396:s| > > 1040897:s|1395290:s"..., > > > > 65536, 6886400 <unfinished ...> > > > > [pid 7085] 12:28:28 pwrite(23, > > "dows%3B%20U%3B% > > > > 20Windows%20NT%20"..., > > > > 65536, 6951936 <unfinished ...> > > > > [pid 7087] 12:28:28 pwrite(23, > > "/533.17.9 (KHTML, > > > > like Gecko) Ve"..., 65536, > > 7017472 <unfinished ...> > > > > [pid 7079] 12:28:28 pwrite(23, > > " r1m+tFW1T5M:: > > > > \"22/Feb/2012:00:0"..., 65536, > > 7083008 > > > > <unfinished ...> [pid 7086] > > 12:28:28 pwrite(19, > > > > "baggins5 61.174.60.117 > > i6MSCBvE1"..., 25159, > > > > 55751097 <unfinished ...> [pid > > 7084] 12:28:28 > > > > pwrite(20, "gardner1 > > 182.118.7.64 TjxzPKdqNU"..., > > > > 10208, 34597920 <unfinished ...> > > [pid 7080] > > > > 12:28:28 pwrite(23, > > "d7eb2c23c1d70cc187c1&alt=json > > > > HT"..., 65536, 7148544 > > <unfinished ...> [pid 7083] > > > > 12:28:28 pwrite(23, > > > > > > "5_Google&type=n&channel=-3&user_"..., > > > > 65536, 7214080 <unfinished ...> > > > > [pid 7085] 12:28:28 pwrite(19, > > "12-02-22 12:28:27 > > > > 1861 \"GET /ser"..., 23179, > > 55776256 > > > > <unfinished ...> [pid 7082] > > 12:28:28 pwrite(23, > > > > > > "\"http://douban.fm/swf/53035/fmpl"..., 65536, > > > > 7279616 <unfinished ...> [pid > > 7078] 12:28:28 > > > > pwrite(20, > > "opic/27639291/add_comment HTTP/1"..., > > > > 18576, 34608128 <unfinished ...> > > [pid 7087] > > > > 12:28:28 pwrite(19, "[\"[\4\5 > > \266v\324\366\245n\t > > > > \315\202\227\\\343=\336-\r > > > > k)\316\354\335\353\373\340 > > \331;"..., 4096, 1024 > > > > <unfinished ...> [pid 7079] > > 12:28:28 pwrite(23, > > > > "ww.douban.com%2Fgroup%2Ftopic% > > 2F"..., > > > > 65536, 7345152 <unfinished ...> > > > > [pid 7081] 12:28:28 pwrite(20, > > > > "\255BYU\355\237\347\226s\261 > > \307N{A\355\203S\306 > > > > \244\255\322[\322\rJ\32[z3\31 > > \311\327"..., > > > > 4096, 1024 <unfinished ...> > > > > [pid 7086] 12:28:28 pwrite(23, > > "patible; MSIE 7.0; > > > > Windows NT 6."..., 65536, > > 7410688 <unfinished ...> > > > > [pid 7084] 12:28:28 pwrite(23, > > "fari/535.7 360EE\" > > > > 0.006\n211.147."..., 65536, > > 7476224 <unfinished ...> > > > > [pid 7080] 12:28:28 pwrite(23, > > "1:OUIVR8CIG5c > > > > \"22/Feb/2012:00:03"..., 65536, > > 7541760 > > > > <unfinished ...> [pid 7085] > > 12:28:28 pwrite(23, "fm > > > > > > \"GET /j/mine/playlist?type=s&"..., 65536, 7607296 > > > > <unfinished ...> [pid 7083] > > 12:28:28 pwrite(23, > > > > > > "pe=n&channel=18&user_id=39266798"..., > > > > 65536, 7672832 <unfinished ...> > > > > [pid 7082] 12:28:28 pwrite(23, > > " 0.023 > > > > \n125.34.190.128 :: > > > > \"22/Feb"..., 65536, 7738368 > > <unfinished ...> [pid > > > > 7078] 12:28:28 pwrite(23, "00 > > 5859 > > > > \"http://www.douban.com/p"..., > > 65536, 7803904 > > > > <unfinished ...> [pid 7079] > > 12:28:28 pwrite(23, > > > > "03:08 +0800\" www.douban.com > > \"GET"..., 65536, > > > > 7869440 <unfinished ...> [pid > > 7086] 12:28:28 > > > > pwrite(23, "type=all HTTP/1.1\" > > 200 1492 \"-\" > > > > "..., 65536, 7934976 > > <unfinished ...> > > > > [pid 7084] 12:28:28 pwrite(23, > > > > > > "Hiapk&user_id=57982902&expire=13"..., > > > > 65536, 8000512 <unfinished ...> > > > > [pid 7080] 12:28:28 pwrite(23, > > "0.011 > > > > \n116.253.89.216 > > rxASuWZf1wg"..., 65536, 8066048 > > > > <unfinished ...> [pid 7085] > > 12:28:28 pwrite(23, "9 > > > > +0800\" www.douban.com > > \"GET /ph"..., 65536, > > > > 8131584) = 65536 <0.000062> > > [pid 7083] 12:28:28 > > > > pwrite(23, " +0800\" > > www.douban.com \"GET /eve"..., > > > > 65536, 8197120 <unfinished ...> > > [pid 7082] 12:28:28 > > > > pwrite(23, " +0800\" > > www.douban.com \"POST /se"..., > > > > 65536, 8262656) = 65536 > > <0.000103> [pid 7087] > > > > 12:28:28 pwrite(23, "0 12971 > > > > \"http://www.douban.com/g"..., > > 65536, 8328192 > > > > <unfinished ...> [pid 7081] > > 12:28:28 pwrite(23, ".0 > > > > (compatible; MSIE 7.0; > > Window"..., 65536, 8393728) = > > > > 65536 <0.000065> > > > > > > > > In order to get better > > performance, the chunk server > > > > should merge the continuous > > sequential write > > > > operations into larger ones. > > > > > > > > -- > > > > - Davies > > > > > > > > > > > > > > ------------------------------------------------------------------------------ > > > > Virtualization & Cloud > > Management Using Capacity > > > > Planning Cloud computing makes > > use of virtualization > > > > - but cloud computing also > > focuses on allowing > > > > computing to be delivered as a > > service. > > > > > > http://www.accelacomm.com/jaw/sfnl/114/51521223/ > > > > > > _______________________________________________ > > > > moosefs-users mailing list > > > > > > moo...@li... > > > > > > https://lists.sourceforge.net/lists/listinfo/moosefs-users > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > - Davies > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------------ > > > > Virtualization & Cloud Management Using > > Capacity Planning > > > > Cloud computing makes use of > > virtualization - but cloud computing > > > > also focuses on allowing computing to be > > delivered as a service. > > > > > > http://www.accelacomm.com/jaw/sfnl/114/51521223/ > > > > > > > > > > > > > > _______________________________________________ > > > > moosefs-users mailing list > > > > moo...@li... > > > > > > https://lists.sourceforge.net/lists/listinfo/moosefs-users > > > > > > > > > > > > > > > > > > > > > -- > > > - Davies > > > > > > > > > -- > > Chris Picton > > > > Executive Manager - Systems > > ECN Telecommunications (Pty) Ltd > > t: 010 590 0031 m: 079 721 8521 > > f: 087 941 0813 > > e: ch...@ec... > > > > "Lowering the cost of doing business" > > > > > > > > > > > > > > -- > > - Davies > > > > > > > -- Chris Picton Executive Manager - Systems ECN Telecommunications (Pty) Ltd t: 010 590 0031 m: 079 721 8521 f: 087 941 0813 e: ch...@ec... "Lowering the cost of doing business" |
From: Davies L. <dav...@gm...> - 2012-03-03 02:26:00
|
new patch: --- mfs-1.6.26/mfschunkserver/hddspacemgr.c 2012-02-28 16:18:26.000000000 +0800 +++ mfs-1.6.26-r4/mfschunkserver/hddspacemgr.c 2012-03-02 20:33:29.000000000 +0800 @@ -1691,9 +1691,16 @@ } } +static inline uint64_t get_usectime() { + struct timeval tv; + gettimeofday(&tv,NULL); + return ((uint64_t)(tv.tv_sec))*1000000+tv.tv_usec; +} + void hdd_delayed_ops() { dopchunk **ccp,*cc,*tcc; uint32_t dhashpos; + uint64_t ts,te; chunk *c; // int status; // printf("delayed ops: before lock\n"); @@ -1756,6 +1763,22 @@ if (c->opensteps>0) { // decrease counter c->opensteps--; } else if (c->fd>=0) { // close descriptor + ts = get_usectime(); +#ifdef F_FULLFSYNC + if (fcntl(c->fd,F_FULLFSYNC)<0) { + int errmem = errno; + mfs_arg_errlog_silent(LOG_WARNING,"hdd_delayed_ops: file:%s - fsync (via fcntl) error",c->filename); + errno = errmem; + } +#else + if (fsync(c->fd)<0) { + int errmem = errno; + mfs_arg_errlog_silent(LOG_WARNING,"hdd_delayed_ops: file:%s - fsync (direct call) error",c->filename); + errno = errmem; + } +#endif + te = get_usectime(); + hdd_stats_datafsync(c->owner,te-ts); if (close(c->fd)<0) { hdd_error_occured(c); // uses and preserves errno !!! mfs_arg_errlog_silent(LOG_WARNING,"hdd_delayed_ops: file:%s - close error",c->filename); @@ -1792,12 +1815,6 @@ // printf("delayed ops: after unlock\n"); } -static inline uint64_t get_usectime() { - struct timeval tv; - gettimeofday(&tv,NULL); - return ((uint64_t)(tv.tv_sec))*1000000+tv.tv_usec; -} - static int hdd_io_begin(chunk *c,int newflag) { dopchunk *cc; int status; @@ -1891,28 +1908,27 @@ errno = errmem; return status; } - ts = get_usectime(); -#ifdef F_FULLFSYNC - if (fcntl(c->fd,F_FULLFSYNC)<0) { - int errmem = errno; - mfs_arg_errlog_silent(LOG_WARNING,"hdd_io_end: file:%s - fsync (via fcntl) error",c->filename); - errno = errmem; - return ERROR_IO; - } -#else - if (fsync(c->fd)<0) { - int errmem = errno; - mfs_arg_errlog_silent(LOG_WARNING,"hdd_io_end: file:%s - fsync (direct call) error",c->filename); - errno = errmem; - return ERROR_IO; - } -#endif - te = get_usectime(); - hdd_stats_datafsync(c->owner,te-ts); } c->crcrefcount--; if (c->crcrefcount==0) { if (OPENSTEPS==0) { + ts = get_usectime(); +#ifdef F_FULLFSYNC + if (fcntl(c->fd,F_FULLFSYNC)<0) { + int errmem = errno; + mfs_arg_errlog_silent(LOG_WARNING,"hdd_io_end: file:%s - fsync (via fcntl) error",c->filename); + errno = errmem; + } +#else + if (fsync(c->fd)<0) { + int errmem = errno; + mfs_arg_errlog_silent(LOG_WARNING,"hdd_io_end: file:%s - fsync (direct call) error",c->filename); + errno = errmem; + } +#endif + te = get_usectime(); + hdd_stats_datafsync(c->owner,te-ts); + if (close(c->fd)<0) { int errmem = errno; c->fd = -1; @@ -3766,6 +3782,7 @@ // } prevf = NULL; c = hdd_chunk_get(chunkid,CH_NEW_AUTO); + if (c == NULL) return; if (c->filename!=NULL) { // already have this chunk if (version <= c->version) { // current chunk is older if (todel<2) { // this is R/W fs? 2012/3/1 Chris Picton <ch...@ec...> > In mfs-1.6.20, the OPENSTEPS vs fd closing logic appears a bit flawed > > The test is made in hdd_io_end: > > if (OPENSTEPS==0) { > > However, OPENSTEPS always is > 0, as it is initialised once at the top > of the file, with a #define > > > This means the file descriptors never get closed (by that logic) > They are closed in hdd_delayed_ops() with 5 seconds delay. So I move the fsync() into hdd_delayed_ops(), just before close(). And, fsync() in hdd_term() before close() is also needed. (not in the patch). > Am I reading the code correctly? > > Chris > > On Thu, 2012-03-01 at 16:51 +0800, Davies Liu wrote: > > I can not figure it out how to make fsync frequency configurable, then > > move fsync() just before close(): > > > > > > --- mfs-1.6.26/mfschunkserver/hddspacemgr.c 2012-02-08 > > 16:15:03.000000000 +0800 > > +++ mfs-1.6.26-r1/mfschunkserver/hddspacemgr.c 2012-03-01 > > 16:17:23.000000000 +0800 > > @@ -1887,28 +1887,28 @@ > > errno = errmem; > > return status; > > } > > - ts = get_usectime(); > > -#ifdef F_FULLFSYNC > > - if (fcntl(c->fd,F_FULLFSYNC)<0) { > > - int errmem = errno; > > - mfs_arg_errlog_silent(LOG_WARNING,"hdd_io_end: > > file:%s - fsync (via fcntl) error",c->filename); > > - errno = errmem; > > - return ERROR_IO; > > - } > > -#else > > - if (fsync(c->fd)<0) { > > - int errmem = errno; > > - mfs_arg_errlog_silent(LOG_WARNING,"hdd_io_end: > > file:%s - fsync (direct call) error",c->filename); > > - errno = errmem; > > - return ERROR_IO; > > - } > > -#endif > > - te = get_usectime(); > > - hdd_stats_datafsync(c->owner,te-ts); > > } > > c->crcrefcount--; > > if (c->crcrefcount==0) { > > if (OPENSTEPS==0) { > > + ts = get_usectime(); > > +#ifdef F_FULLFSYNC > > + if (fcntl(c->fd,F_FULLFSYNC)<0) { > > + int errmem = errno; > > + > > mfs_arg_errlog_silent(LOG_WARNING,"hdd_io_end: file:%s - fsync (via > > fcntl) error",c->filename); > > + errno = errmem; > > + return ERROR_IO; > > + } > > +#else > > + if (fsync(c->fd)<0) { > > + int errmem = errno; > > + > > mfs_arg_errlog_silent(LOG_WARNING,"hdd_io_end: file:%s - fsync (direct > > call) error",c->filename); > > + errno = errmem; > > + return ERROR_IO; > > + } > > +#endif > > + te = get_usectime(); > > + hdd_stats_datafsync(c->owner,te-ts); > > if (close(c->fd)<0) { > > int errmem = errno; > > c->fd = -1; > > > > On Thu, Mar 1, 2012 at 12:21 PM, Chris Picton <ch...@ec...> > > wrote: > > I have had similar ideas > > > > I currently have a patch to disable fsync on every block > > close, however this will probably lead to data corruption if > > there is a site-wide power outage. > > > > My thoughts are as follows: > > * Create config variable FLUSH_ON_WRITE (0/1) to disable > > or enable flush on write > > * Create config variable FLUSH_DELAY (seconds) to > > prevent flushing immediately - rather the pointers to > > the written chunks would be stored, and looped through > > (in a separate thread?), to flush any which are older > > than the delay. This would ensure that the chunks > > have a maximum time during which they may potentially > > be invalid on disk. If the FLUSH_DELAY is 0, then > > behaviour is as current > > * Create config variable CHECKSUM_INITIAL (0/1) If set > > to 1 would force a checksum of *all* blocks on a > > chunkserver at startup, to find potentially bad chunks > > before they are used. Is this necessary, though? Are > > checksums read on every block read? > > I may start making the above patches, if I get time to do so. > > > > Chris > > > > > > > > > > On 2012/03/01 6:09 AM, Davies Liu wrote: > > > Hi Michal! > > > > > > > > > I have found the reason for bad write performance, the chunk > > > server will write the crc block > > > into disk EVERY second, then fsync(), which will take about > > > 28ms. > > > > > > Can I reduce frequency of fsync() to every minute, then > > > check the chunk modified in recent > > > 30 minutes after booting? > > > > > > Davies > > > > > > 2012/2/23 Michał Borychowski <mic...@ge...> > > > Hi Davies! > > > > > > Here is our analysis of this situation. Different > > > files are written simultaneously on the same CS - > > > that's why pwrites are written do different files. > > > Block size of 64kB is not that small. Writes in OS > > > are sent through write cache so all saves being > > > multiplication of 4096B should work equally fast. > > > > > > Our tests: > > > > > > dd on Linux 64k : 640k > > > $ dd if=/dev/zero of=/tmp/test bs=64k count=10000 > > > 10000+0 records in > > > 10000+0 records out > > > 655360000 bytes (655 MB) copied, 22.1836 s, 29.5 > > > MB/s > > > > > > $ dd if=/dev/zero of=/tmp/test bs=640k count=1000 > > > 1000+0 records in > > > 1000+0 records out > > > 655360000 bytes (655 MB) copied, 23.1311 s, 28.3 > > > MB/s > > > > > > dd on Mac OS X 64k : 640k > > > $ dd if=/dev/zero of=/tmp/test bs=64k count=10000 > > > 10000+0 records in > > > 10000+0 records out > > > 655360000 bytes transferred in 14.874652 secs > > > (44058846 bytes/sec) > > > > > > $ dd if=/dev/zero of=/tmp/test bs=640k count=1000 > > > 1000+0 records in > > > 1000+0 records out > > > 655360000 bytes transferred in 14.578427 secs > > > (44954096 bytes/sec) > > > > > > So the times are similar. Saves going to different > > > files also should not be a problem as a kernel > > > scheduler takes care of this. > > > > > > If you have some specific idea how to improve the > > > saves please share it with us. > > > > > > > > > Kind regards > > > Michał > > > > > > -----Original Message----- > > > From: Davies Liu [mailto:dav...@gm...] > > > Sent: Wednesday, February 22, 2012 8:24 AM > > > To: moo...@li... > > > Subject: [Moosefs-users] Bad write performance of > > > mfschunkserver > > > > > > Hi,devs: > > > > > > Today, We found that some mfschunkserver were not > > > responsive, caused many timeout in mfsmount, then > > > all the write operation were blocked. > > > > > > After some digging, we found that there were some > > > small but continuous write bandwidth, strace show > > > that many small pwrite() between several files: > > > > > > [pid 7087] 12:28:28 pwrite(19, "baggins3 > > > 60.210.18.235 sE7NtNQU7"..., 25995, 55684725 > > > <unfinished ...> [pid 7078] 12:28:28 pwrite(17, > > > "2012/02/22 12:28:28:root: WARNIN"..., 69, 21768909 > > > <unfinished ...> [pid 7080] 12:28:28 pwrite(20, > > > "gardner4 183.7.50.169 mr5vi+Z4H3"..., 47663, > > > 34550257 <unfinished ...> [pid 7079] 12:28:28 > > > pwrite(19, "\" \"Mozilla/5.0 (Windows NT 6.1) "..., > > > 40377, 55710720 <unfinished ...> [pid 7086] > > > 12:28:28 pwrite(23, "MATP; InfoPath.2; .NET4.0C; > > > 360S"..., 65536, 6427648 <unfinished ...> [pid > > > 7082] 12:28:28 pwrite(23, "; GTB7.2; SLCC2; .NET > > > CLR 2.0.50"..., 65536, 6493184 <unfinished ...> > > > [pid 7083] 12:28:28 pwrite(20, "\255BYU\355\237\347 > > > \226s\261\307N{A\355\203S\306\244\255\322[\322\rJ > > > \32[z3\31\311\327"..., > > > 4096, 1024 <unfinished ...> > > > [pid 7078] 12:28:28 pwrite(23, > > > "ovie/subject/4724373/reviews?sta"..., > > > 65536, 6558720 <unfinished ...> > > > [pid 7080] 12:28:28 pwrite(19, > > > "[\"[\4\5\266v\324\366\245n\t\315\202\227\\\343= > > > \336-\r > > > k)\316\354\335\353\373\340\331;"..., 4096, 1024 > > > <unfinished ...> [pid 7079] 12:28:28 pwrite(23, > > > "ta-Python/2.0.15\" > > > 0.016\n211.147."..., 65536, 6624256 <unfinished ...> > > > [pid 7081] 12:28:28 pwrite(23, > > > "4034093?apikey=0eb695f25995d7eb2"..., > > > 65536, 6689792 <unfinished ...> > > > [pid 7084] 12:28:28 pwrite(23, " > > > y8G23n95BKY:43534427:wind8vssc4"..., > > > 65536, 6755328) = 65536 <0.000108> > > > [pid 7078] 12:28:28 pwrite(23, > > > "TkVvKuXfug:3248233:5Yo9vFoOIuo \""..., 65536, > > > 6820864 <unfinished ...> [pid 7086] 12:28:28 > > > pwrite(23, ":s|1563396:s|1040897:s|1395290:s"..., > > > 65536, 6886400 <unfinished ...> > > > [pid 7085] 12:28:28 pwrite(23, "dows%3B%20U%3B% > > > 20Windows%20NT%20"..., > > > 65536, 6951936 <unfinished ...> > > > [pid 7087] 12:28:28 pwrite(23, "/533.17.9 (KHTML, > > > like Gecko) Ve"..., 65536, 7017472 <unfinished ...> > > > [pid 7079] 12:28:28 pwrite(23, " r1m+tFW1T5M:: > > > \"22/Feb/2012:00:0"..., 65536, 7083008 > > > <unfinished ...> [pid 7086] 12:28:28 pwrite(19, > > > "baggins5 61.174.60.117 i6MSCBvE1"..., 25159, > > > 55751097 <unfinished ...> [pid 7084] 12:28:28 > > > pwrite(20, "gardner1 182.118.7.64 TjxzPKdqNU"..., > > > 10208, 34597920 <unfinished ...> [pid 7080] > > > 12:28:28 pwrite(23, "d7eb2c23c1d70cc187c1&alt=json > > > HT"..., 65536, 7148544 <unfinished ...> [pid 7083] > > > 12:28:28 pwrite(23, > > > "5_Google&type=n&channel=-3&user_"..., > > > 65536, 7214080 <unfinished ...> > > > [pid 7085] 12:28:28 pwrite(19, "12-02-22 12:28:27 > > > 1861 \"GET /ser"..., 23179, 55776256 > > > <unfinished ...> [pid 7082] 12:28:28 pwrite(23, > > > "\"http://douban.fm/swf/53035/fmpl"..., 65536, > > > 7279616 <unfinished ...> [pid 7078] 12:28:28 > > > pwrite(20, "opic/27639291/add_comment HTTP/1"..., > > > 18576, 34608128 <unfinished ...> [pid 7087] > > > 12:28:28 pwrite(19, "[\"[\4\5\266v\324\366\245n\t > > > \315\202\227\\\343=\336-\r > > > k)\316\354\335\353\373\340\331;"..., 4096, 1024 > > > <unfinished ...> [pid 7079] 12:28:28 pwrite(23, > > > "ww.douban.com%2Fgroup%2Ftopic%2F"..., > > > 65536, 7345152 <unfinished ...> > > > [pid 7081] 12:28:28 pwrite(20, > > > "\255BYU\355\237\347\226s\261\307N{A\355\203S\306 > > > \244\255\322[\322\rJ\32[z3\31\311\327"..., > > > 4096, 1024 <unfinished ...> > > > [pid 7086] 12:28:28 pwrite(23, "patible; MSIE 7.0; > > > Windows NT 6."..., 65536, 7410688 <unfinished ...> > > > [pid 7084] 12:28:28 pwrite(23, "fari/535.7 360EE\" > > > 0.006\n211.147."..., 65536, 7476224 <unfinished ...> > > > [pid 7080] 12:28:28 pwrite(23, "1:OUIVR8CIG5c > > > \"22/Feb/2012:00:03"..., 65536, 7541760 > > > <unfinished ...> [pid 7085] 12:28:28 pwrite(23, "fm > > > \"GET /j/mine/playlist?type=s&"..., 65536, 7607296 > > > <unfinished ...> [pid 7083] 12:28:28 pwrite(23, > > > "pe=n&channel=18&user_id=39266798"..., > > > 65536, 7672832 <unfinished ...> > > > [pid 7082] 12:28:28 pwrite(23, " 0.023 > > > \n125.34.190.128 :: > > > \"22/Feb"..., 65536, 7738368 <unfinished ...> [pid > > > 7078] 12:28:28 pwrite(23, "00 5859 > > > \"http://www.douban.com/p"..., 65536, 7803904 > > > <unfinished ...> [pid 7079] 12:28:28 pwrite(23, > > > "03:08 +0800\" www.douban.com \"GET"..., 65536, > > > 7869440 <unfinished ...> [pid 7086] 12:28:28 > > > pwrite(23, "type=all HTTP/1.1\" 200 1492 \"-\" > > > "..., 65536, 7934976 <unfinished ...> > > > [pid 7084] 12:28:28 pwrite(23, > > > "Hiapk&user_id=57982902&expire=13"..., > > > 65536, 8000512 <unfinished ...> > > > [pid 7080] 12:28:28 pwrite(23, "0.011 > > > \n116.253.89.216 rxASuWZf1wg"..., 65536, 8066048 > > > <unfinished ...> [pid 7085] 12:28:28 pwrite(23, "9 > > > +0800\" www.douban.com \"GET /ph"..., 65536, > > > 8131584) = 65536 <0.000062> [pid 7083] 12:28:28 > > > pwrite(23, " +0800\" www.douban.com \"GET /eve"..., > > > 65536, 8197120 <unfinished ...> [pid 7082] 12:28:28 > > > pwrite(23, " +0800\" www.douban.com \"POST /se"..., > > > 65536, 8262656) = 65536 <0.000103> [pid 7087] > > > 12:28:28 pwrite(23, "0 12971 > > > \"http://www.douban.com/g"..., 65536, 8328192 > > > <unfinished ...> [pid 7081] 12:28:28 pwrite(23, ".0 > > > (compatible; MSIE 7.0; Window"..., 65536, 8393728) = > > > 65536 <0.000065> > > > > > > In order to get better performance, the chunk server > > > should merge the continuous sequential write > > > operations into larger ones. > > > > > > -- > > > - Davies > > > > > > > > > > ------------------------------------------------------------------------------ > > > Virtualization & Cloud Management Using Capacity > > > Planning Cloud computing makes use of virtualization > > > - but cloud computing also focuses on allowing > > > computing to be delivered as a service. > > > http://www.accelacomm.com/jaw/sfnl/114/51521223/ > > > _______________________________________________ > > > moosefs-users mailing list > > > moo...@li... > > > > https://lists.sourceforge.net/lists/listinfo/moosefs-users > > > > > > > > > > > > > > > > > > -- > > > - Davies > > > > > > > > > > > > > ------------------------------------------------------------------------------ > > > Virtualization & Cloud Management Using Capacity Planning > > > Cloud computing makes use of virtualization - but cloud > computing > > > also focuses on allowing computing to be delivered as a > service. > > > http://www.accelacomm.com/jaw/sfnl/114/51521223/ > > > > > > > > > _______________________________________________ > > > moosefs-users mailing list > > > moo...@li... > > > https://lists.sourceforge.net/lists/listinfo/moosefs-users > > > > > > > > > > > > > > -- > > - Davies > > > > -- > Chris Picton > > Executive Manager - Systems > ECN Telecommunications (Pty) Ltd > t: 010 590 0031 m: 079 721 8521 > f: 087 941 0813 > e: ch...@ec... > > "Lowering the cost of doing business" > > > -- - Davies |
From: Wang J. <jia...@re...> - 2012-03-01 17:08:07
|
于 2012/3/1 20:35, Steve Thompson 写道: > On Tue, 28 Feb 2012, Ricardo J. Barberis wrote: > >> This happened to me once and I also took down every server of the cluster, 9 >> chunkservers and one dedicated metalogger (previously, I unmounted all the >> clients, about 250). >> >> Bad idea: when the master came on-line again and I started one chunkserver, >> the master went "crazy" triyng to recreate empty chunks for later deletion. >> >> My "solution" was to start all the chunkservers at the same time, so the >> master saw all the chunks almost simultaneously and didn't try to create >> empty chunks. > It turns out that the master was very very slow because a RAID-5 > reconstruction was in progress on the box. The I/O performance dropped to > 5% of its normal value, in the spite of the RAID controller throttle being > set to 30% maximum reconstruction rate (it's a Dell PE2900 server with a > Perc 5 controller). Once the reconstruction finished, I restarted > everything (all chunk servers at the same time) and it came up fine within > a few seconds. > > Steve Use SSD and RAID10 whenever possible for "meta" servers nowadays, IO load during recovery should always be taken into consideration. |
From: Chris P. <ch...@ec...> - 2012-03-01 13:09:50
|
In mfs-1.6.20, the OPENSTEPS vs fd closing logic appears a bit flawed The test is made in hdd_io_end: if (OPENSTEPS==0) { However, OPENSTEPS always is > 0, as it is initialised once at the top of the file, with a #define This means the file descriptors never get closed (by that logic) Am I reading the code correctly? Chris On Thu, 2012-03-01 at 16:51 +0800, Davies Liu wrote: > I can not figure it out how to make fsync frequency configurable, then > move fsync() just before close(): > > > --- mfs-1.6.26/mfschunkserver/hddspacemgr.c 2012-02-08 > 16:15:03.000000000 +0800 > +++ mfs-1.6.26-r1/mfschunkserver/hddspacemgr.c 2012-03-01 > 16:17:23.000000000 +0800 > @@ -1887,28 +1887,28 @@ > errno = errmem; > return status; > } > - ts = get_usectime(); > -#ifdef F_FULLFSYNC > - if (fcntl(c->fd,F_FULLFSYNC)<0) { > - int errmem = errno; > - mfs_arg_errlog_silent(LOG_WARNING,"hdd_io_end: > file:%s - fsync (via fcntl) error",c->filename); > - errno = errmem; > - return ERROR_IO; > - } > -#else > - if (fsync(c->fd)<0) { > - int errmem = errno; > - mfs_arg_errlog_silent(LOG_WARNING,"hdd_io_end: > file:%s - fsync (direct call) error",c->filename); > - errno = errmem; > - return ERROR_IO; > - } > -#endif > - te = get_usectime(); > - hdd_stats_datafsync(c->owner,te-ts); > } > c->crcrefcount--; > if (c->crcrefcount==0) { > if (OPENSTEPS==0) { > + ts = get_usectime(); > +#ifdef F_FULLFSYNC > + if (fcntl(c->fd,F_FULLFSYNC)<0) { > + int errmem = errno; > + > mfs_arg_errlog_silent(LOG_WARNING,"hdd_io_end: file:%s - fsync (via > fcntl) error",c->filename); > + errno = errmem; > + return ERROR_IO; > + } > +#else > + if (fsync(c->fd)<0) { > + int errmem = errno; > + > mfs_arg_errlog_silent(LOG_WARNING,"hdd_io_end: file:%s - fsync (direct > call) error",c->filename); > + errno = errmem; > + return ERROR_IO; > + } > +#endif > + te = get_usectime(); > + hdd_stats_datafsync(c->owner,te-ts); > if (close(c->fd)<0) { > int errmem = errno; > c->fd = -1; > > On Thu, Mar 1, 2012 at 12:21 PM, Chris Picton <ch...@ec...> > wrote: > I have had similar ideas > > I currently have a patch to disable fsync on every block > close, however this will probably lead to data corruption if > there is a site-wide power outage. > > My thoughts are as follows: > * Create config variable FLUSH_ON_WRITE (0/1) to disable > or enable flush on write > * Create config variable FLUSH_DELAY (seconds) to > prevent flushing immediately - rather the pointers to > the written chunks would be stored, and looped through > (in a separate thread?), to flush any which are older > than the delay. This would ensure that the chunks > have a maximum time during which they may potentially > be invalid on disk. If the FLUSH_DELAY is 0, then > behaviour is as current > * Create config variable CHECKSUM_INITIAL (0/1) If set > to 1 would force a checksum of *all* blocks on a > chunkserver at startup, to find potentially bad chunks > before they are used. Is this necessary, though? Are > checksums read on every block read? > I may start making the above patches, if I get time to do so. > > Chris > > > > > On 2012/03/01 6:09 AM, Davies Liu wrote: > > Hi Michal! > > > > > > I have found the reason for bad write performance, the chunk > > server will write the crc block > > into disk EVERY second, then fsync(), which will take about > > 28ms. > > > > Can I reduce frequency of fsync() to every minute, then > > check the chunk modified in recent > > 30 minutes after booting? > > > > Davies > > > > 2012/2/23 Michał Borychowski <mic...@ge...> > > Hi Davies! > > > > Here is our analysis of this situation. Different > > files are written simultaneously on the same CS - > > that's why pwrites are written do different files. > > Block size of 64kB is not that small. Writes in OS > > are sent through write cache so all saves being > > multiplication of 4096B should work equally fast. > > > > Our tests: > > > > dd on Linux 64k : 640k > > $ dd if=/dev/zero of=/tmp/test bs=64k count=10000 > > 10000+0 records in > > 10000+0 records out > > 655360000 bytes (655 MB) copied, 22.1836 s, 29.5 > > MB/s > > > > $ dd if=/dev/zero of=/tmp/test bs=640k count=1000 > > 1000+0 records in > > 1000+0 records out > > 655360000 bytes (655 MB) copied, 23.1311 s, 28.3 > > MB/s > > > > dd on Mac OS X 64k : 640k > > $ dd if=/dev/zero of=/tmp/test bs=64k count=10000 > > 10000+0 records in > > 10000+0 records out > > 655360000 bytes transferred in 14.874652 secs > > (44058846 bytes/sec) > > > > $ dd if=/dev/zero of=/tmp/test bs=640k count=1000 > > 1000+0 records in > > 1000+0 records out > > 655360000 bytes transferred in 14.578427 secs > > (44954096 bytes/sec) > > > > So the times are similar. Saves going to different > > files also should not be a problem as a kernel > > scheduler takes care of this. > > > > If you have some specific idea how to improve the > > saves please share it with us. > > > > > > Kind regards > > Michał > > > > -----Original Message----- > > From: Davies Liu [mailto:dav...@gm...] > > Sent: Wednesday, February 22, 2012 8:24 AM > > To: moo...@li... > > Subject: [Moosefs-users] Bad write performance of > > mfschunkserver > > > > Hi,devs: > > > > Today, We found that some mfschunkserver were not > > responsive, caused many timeout in mfsmount, then > > all the write operation were blocked. > > > > After some digging, we found that there were some > > small but continuous write bandwidth, strace show > > that many small pwrite() between several files: > > > > [pid 7087] 12:28:28 pwrite(19, "baggins3 > > 60.210.18.235 sE7NtNQU7"..., 25995, 55684725 > > <unfinished ...> [pid 7078] 12:28:28 pwrite(17, > > "2012/02/22 12:28:28:root: WARNIN"..., 69, 21768909 > > <unfinished ...> [pid 7080] 12:28:28 pwrite(20, > > "gardner4 183.7.50.169 mr5vi+Z4H3"..., 47663, > > 34550257 <unfinished ...> [pid 7079] 12:28:28 > > pwrite(19, "\" \"Mozilla/5.0 (Windows NT 6.1) "..., > > 40377, 55710720 <unfinished ...> [pid 7086] > > 12:28:28 pwrite(23, "MATP; InfoPath.2; .NET4.0C; > > 360S"..., 65536, 6427648 <unfinished ...> [pid > > 7082] 12:28:28 pwrite(23, "; GTB7.2; SLCC2; .NET > > CLR 2.0.50"..., 65536, 6493184 <unfinished ...> > > [pid 7083] 12:28:28 pwrite(20, "\255BYU\355\237\347 > > \226s\261\307N{A\355\203S\306\244\255\322[\322\rJ > > \32[z3\31\311\327"..., > > 4096, 1024 <unfinished ...> > > [pid 7078] 12:28:28 pwrite(23, > > "ovie/subject/4724373/reviews?sta"..., > > 65536, 6558720 <unfinished ...> > > [pid 7080] 12:28:28 pwrite(19, > > "[\"[\4\5\266v\324\366\245n\t\315\202\227\\\343= > > \336-\r > > k)\316\354\335\353\373\340\331;"..., 4096, 1024 > > <unfinished ...> [pid 7079] 12:28:28 pwrite(23, > > "ta-Python/2.0.15\" > > 0.016\n211.147."..., 65536, 6624256 <unfinished ...> > > [pid 7081] 12:28:28 pwrite(23, > > "4034093?apikey=0eb695f25995d7eb2"..., > > 65536, 6689792 <unfinished ...> > > [pid 7084] 12:28:28 pwrite(23, " > > y8G23n95BKY:43534427:wind8vssc4"..., > > 65536, 6755328) = 65536 <0.000108> > > [pid 7078] 12:28:28 pwrite(23, > > "TkVvKuXfug:3248233:5Yo9vFoOIuo \""..., 65536, > > 6820864 <unfinished ...> [pid 7086] 12:28:28 > > pwrite(23, ":s|1563396:s|1040897:s|1395290:s"..., > > 65536, 6886400 <unfinished ...> > > [pid 7085] 12:28:28 pwrite(23, "dows%3B%20U%3B% > > 20Windows%20NT%20"..., > > 65536, 6951936 <unfinished ...> > > [pid 7087] 12:28:28 pwrite(23, "/533.17.9 (KHTML, > > like Gecko) Ve"..., 65536, 7017472 <unfinished ...> > > [pid 7079] 12:28:28 pwrite(23, " r1m+tFW1T5M:: > > \"22/Feb/2012:00:0"..., 65536, 7083008 > > <unfinished ...> [pid 7086] 12:28:28 pwrite(19, > > "baggins5 61.174.60.117 i6MSCBvE1"..., 25159, > > 55751097 <unfinished ...> [pid 7084] 12:28:28 > > pwrite(20, "gardner1 182.118.7.64 TjxzPKdqNU"..., > > 10208, 34597920 <unfinished ...> [pid 7080] > > 12:28:28 pwrite(23, "d7eb2c23c1d70cc187c1&alt=json > > HT"..., 65536, 7148544 <unfinished ...> [pid 7083] > > 12:28:28 pwrite(23, > > "5_Google&type=n&channel=-3&user_"..., > > 65536, 7214080 <unfinished ...> > > [pid 7085] 12:28:28 pwrite(19, "12-02-22 12:28:27 > > 1861 \"GET /ser"..., 23179, 55776256 > > <unfinished ...> [pid 7082] 12:28:28 pwrite(23, > > "\"http://douban.fm/swf/53035/fmpl"..., 65536, > > 7279616 <unfinished ...> [pid 7078] 12:28:28 > > pwrite(20, "opic/27639291/add_comment HTTP/1"..., > > 18576, 34608128 <unfinished ...> [pid 7087] > > 12:28:28 pwrite(19, "[\"[\4\5\266v\324\366\245n\t > > \315\202\227\\\343=\336-\r > > k)\316\354\335\353\373\340\331;"..., 4096, 1024 > > <unfinished ...> [pid 7079] 12:28:28 pwrite(23, > > "ww.douban.com%2Fgroup%2Ftopic%2F"..., > > 65536, 7345152 <unfinished ...> > > [pid 7081] 12:28:28 pwrite(20, > > "\255BYU\355\237\347\226s\261\307N{A\355\203S\306 > > \244\255\322[\322\rJ\32[z3\31\311\327"..., > > 4096, 1024 <unfinished ...> > > [pid 7086] 12:28:28 pwrite(23, "patible; MSIE 7.0; > > Windows NT 6."..., 65536, 7410688 <unfinished ...> > > [pid 7084] 12:28:28 pwrite(23, "fari/535.7 360EE\" > > 0.006\n211.147."..., 65536, 7476224 <unfinished ...> > > [pid 7080] 12:28:28 pwrite(23, "1:OUIVR8CIG5c > > \"22/Feb/2012:00:03"..., 65536, 7541760 > > <unfinished ...> [pid 7085] 12:28:28 pwrite(23, "fm > > \"GET /j/mine/playlist?type=s&"..., 65536, 7607296 > > <unfinished ...> [pid 7083] 12:28:28 pwrite(23, > > "pe=n&channel=18&user_id=39266798"..., > > 65536, 7672832 <unfinished ...> > > [pid 7082] 12:28:28 pwrite(23, " 0.023 > > \n125.34.190.128 :: > > \"22/Feb"..., 65536, 7738368 <unfinished ...> [pid > > 7078] 12:28:28 pwrite(23, "00 5859 > > \"http://www.douban.com/p"..., 65536, 7803904 > > <unfinished ...> [pid 7079] 12:28:28 pwrite(23, > > "03:08 +0800\" www.douban.com \"GET"..., 65536, > > 7869440 <unfinished ...> [pid 7086] 12:28:28 > > pwrite(23, "type=all HTTP/1.1\" 200 1492 \"-\" > > "..., 65536, 7934976 <unfinished ...> > > [pid 7084] 12:28:28 pwrite(23, > > "Hiapk&user_id=57982902&expire=13"..., > > 65536, 8000512 <unfinished ...> > > [pid 7080] 12:28:28 pwrite(23, "0.011 > > \n116.253.89.216 rxASuWZf1wg"..., 65536, 8066048 > > <unfinished ...> [pid 7085] 12:28:28 pwrite(23, "9 > > +0800\" www.douban.com \"GET /ph"..., 65536, > > 8131584) = 65536 <0.000062> [pid 7083] 12:28:28 > > pwrite(23, " +0800\" www.douban.com \"GET /eve"..., > > 65536, 8197120 <unfinished ...> [pid 7082] 12:28:28 > > pwrite(23, " +0800\" www.douban.com \"POST /se"..., > > 65536, 8262656) = 65536 <0.000103> [pid 7087] > > 12:28:28 pwrite(23, "0 12971 > > \"http://www.douban.com/g"..., 65536, 8328192 > > <unfinished ...> [pid 7081] 12:28:28 pwrite(23, ".0 > > (compatible; MSIE 7.0; Window"..., 65536, 8393728) = > > 65536 <0.000065> > > > > In order to get better performance, the chunk server > > should merge the continuous sequential write > > operations into larger ones. > > > > -- > > - Davies > > > > > > ------------------------------------------------------------------------------ > > Virtualization & Cloud Management Using Capacity > > Planning Cloud computing makes use of virtualization > > - but cloud computing also focuses on allowing > > computing to be delivered as a service. > > http://www.accelacomm.com/jaw/sfnl/114/51521223/ > > _______________________________________________ > > moosefs-users mailing list > > moo...@li... > > https://lists.sourceforge.net/lists/listinfo/moosefs-users > > > > > > > > > > > > -- > > - Davies > > > > > > > > ------------------------------------------------------------------------------ > > Virtualization & Cloud Management Using Capacity Planning > > Cloud computing makes use of virtualization - but cloud computing > > also focuses on allowing computing to be delivered as a service. > > http://www.accelacomm.com/jaw/sfnl/114/51521223/ > > > > > > _______________________________________________ > > moosefs-users mailing list > > moo...@li... > > https://lists.sourceforge.net/lists/listinfo/moosefs-users > > > > > > > -- > - Davies > -- Chris Picton Executive Manager - Systems ECN Telecommunications (Pty) Ltd t: 010 590 0031 m: 079 721 8521 f: 087 941 0813 e: ch...@ec... "Lowering the cost of doing business" |
From: Steve T. <sm...@cb...> - 2012-03-01 12:35:23
|
On Tue, 28 Feb 2012, Ricardo J. Barberis wrote: > This happened to me once and I also took down every server of the cluster, 9 > chunkservers and one dedicated metalogger (previously, I unmounted all the > clients, about 250). > > Bad idea: when the master came on-line again and I started one chunkserver, > the master went "crazy" triyng to recreate empty chunks for later deletion. > > My "solution" was to start all the chunkservers at the same time, so the > master saw all the chunks almost simultaneously and didn't try to create > empty chunks. It turns out that the master was very very slow because a RAID-5 reconstruction was in progress on the box. The I/O performance dropped to 5% of its normal value, in the spite of the RAID controller throttle being set to 30% maximum reconstruction rate (it's a Dell PE2900 server with a Perc 5 controller). Once the reconstruction finished, I restarted everything (all chunk servers at the same time) and it came up fine within a few seconds. Steve -- ---------------------------------------------------------------------------- Steve Thompson, Cornell School of Chemical and Biomolecular Engineering smt AT cbe DOT cornell DOT edu "186,282 miles per second: it's not just a good idea, it's the law" ---------------------------------------------------------------------------- |
From: Michał B. <mic...@ge...> - 2012-03-01 12:17:28
|
Hi Davies and Chris! Your remarks are very interesting. Please wait with your patches - we’ll take these into account in one of the future versions. Regards Michal From: Davies Liu [mailto:dav...@gm...] Sent: Thursday, March 01, 2012 9:52 AM To: Chris Picton Cc: Michał Borychowski; moo...@li... Subject: Re: [Moosefs-users] Bad write performance of mfschunkserver I can not figure it out how to make fsync frequency configurable, then move fsync() just before close(): --- mfs-1.6.26/mfschunkserver/hddspacemgr.c 2012-02-08 16:15:03.000000000 +0800 +++ mfs-1.6.26-r1/mfschunkserver/hddspacemgr.c 2012-03-01 16:17:23.000000000 +0800 @@ -1887,28 +1887,28 @@ errno = errmem; return status; } - ts = get_usectime(); -#ifdef F_FULLFSYNC - if (fcntl(c->fd,F_FULLFSYNC)<0) { - int errmem = errno; - mfs_arg_errlog_silent(LOG_WARNING,"hdd_io_end: file:%s - fsync (via fcntl) error",c->filename); - errno = errmem; - return ERROR_IO; - } -#else - if (fsync(c->fd)<0) { - int errmem = errno; - mfs_arg_errlog_silent(LOG_WARNING,"hdd_io_end: file:%s - fsync (direct call) error",c->filename); - errno = errmem; - return ERROR_IO; - } -#endif - te = get_usectime(); - hdd_stats_datafsync(c->owner,te-ts); } c->crcrefcount--; if (c->crcrefcount==0) { if (OPENSTEPS==0) { + ts = get_usectime(); +#ifdef F_FULLFSYNC + if (fcntl(c->fd,F_FULLFSYNC)<0) { + int errmem = errno; + mfs_arg_errlog_silent(LOG_WARNING,"hdd_io_end: file:%s - fsync (via fcntl) error",c->filename); + errno = errmem; + return ERROR_IO; + } +#else + if (fsync(c->fd)<0) { + int errmem = errno; + mfs_arg_errlog_silent(LOG_WARNING,"hdd_io_end: file:%s - fsync (direct call) error",c->filename); + errno = errmem; + return ERROR_IO; + } +#endif + te = get_usectime(); + hdd_stats_datafsync(c->owner,te-ts); if (close(c->fd)<0) { int errmem = errno; c->fd = -1; On Thu, Mar 1, 2012 at 12:21 PM, Chris Picton <ch...@ec...> wrote: I have had similar ideas I currently have a patch to disable fsync on every block close, however this will probably lead to data corruption if there is a site-wide power outage. My thoughts are as follows: · Create config variable FLUSH_ON_WRITE (0/1) to disable or enable flush on write · Create config variable FLUSH_DELAY (seconds) to prevent flushing immediately - rather the pointers to the written chunks would be stored, and looped through (in a separate thread?), to flush any which are older than the delay. This would ensure that the chunks have a maximum time during which they may potentially be invalid on disk. If the FLUSH_DELAY is 0, then behaviour is as current · Create config variable CHECKSUM_INITIAL (0/1) If set to 1 would force a checksum of *all* blocks on a chunkserver at startup, to find potentially bad chunks before they are used. Is this necessary, though? Are checksums read on every block read? I may start making the above patches, if I get time to do so. Chris On 2012/03/01 6:09 AM, Davies Liu wrote: Hi Michal! I have found the reason for bad write performance, the chunk server will write the crc block into disk EVERY second, then fsync(), which will take about 28ms. Can I reduce frequency of fsync() to every minute, then check the chunk modified in recent 30 minutes after booting? Davies 2012/2/23 Michał Borychowski <mic...@ge...> Hi Davies! Here is our analysis of this situation. Different files are written simultaneously on the same CS - that's why pwrites are written do different files. Block size of 64kB is not that small. Writes in OS are sent through write cache so all saves being multiplication of 4096B should work equally fast. Our tests: dd on Linux 64k : 640k $ dd if=/dev/zero of=/tmp/test bs=64k count=10000 10000+0 records in 10000+0 records out 655360000 bytes (655 MB) copied, 22.1836 s, 29.5 MB/s $ dd if=/dev/zero of=/tmp/test bs=640k count=1000 1000+0 records in 1000+0 records out 655360000 bytes (655 MB) copied, 23.1311 s, 28.3 MB/s dd on Mac OS X 64k : 640k $ dd if=/dev/zero of=/tmp/test bs=64k count=10000 10000+0 records in 10000+0 records out 655360000 bytes transferred in 14.874652 secs (44058846 bytes/sec) $ dd if=/dev/zero of=/tmp/test bs=640k count=1000 1000+0 records in 1000+0 records out 655360000 bytes transferred in 14.578427 secs (44954096 bytes/sec) So the times are similar. Saves going to different files also should not be a problem as a kernel scheduler takes care of this. If you have some specific idea how to improve the saves please share it with us. Kind regards Michał -----Original Message----- From: Davies Liu [mailto:dav...@gm...] Sent: Wednesday, February 22, 2012 8:24 AM To: moo...@li... Subject: [Moosefs-users] Bad write performance of mfschunkserver Hi,devs: Today, We found that some mfschunkserver were not responsive, caused many timeout in mfsmount, then all the write operation were blocked. After some digging, we found that there were some small but continuous write bandwidth, strace show that many small pwrite() between several files: [pid 7087] 12:28:28 pwrite(19, "baggins3 60.210.18.235 sE7NtNQU7"..., 25995, 55684725 <unfinished ...> [pid 7078] 12:28:28 pwrite(17, "2012/02/22 12:28:28:root: WARNIN"..., 69, 21768909 <unfinished ...> [pid 7080] 12:28:28 pwrite(20, "gardner4 183.7.50.169 mr5vi+Z4H3"..., 47663, 34550257 <unfinished ...> [pid 7079] 12:28:28 pwrite(19, "\" \"Mozilla/5.0 (Windows NT 6.1) "..., 40377, 55710720 <unfinished ...> [pid 7086] 12:28:28 pwrite(23, "MATP; InfoPath.2; .NET4.0C; 360S"..., 65536, 6427648 <unfinished ...> [pid 7082] 12:28:28 pwrite(23, "; GTB7.2; SLCC2; .NET CLR 2.0.50"..., 65536, 6493184 <unfinished ...> [pid 7083] 12:28:28 pwrite(20, "\255BYU\355\237\347\226s\261\307N{A\355\203S\306\244\255\322[\322\rJ\32[z3\31\311\327"..., 4096, 1024 <unfinished ...> [pid 7078] 12:28:28 pwrite(23, "ovie/subject/4724373/reviews?sta"..., 65536, 6558720 <unfinished ...> [pid 7080] 12:28:28 pwrite(19, "[\"[\4\5\266v\324\366\245n\t\315\202\227\\\343=\336-\r k)\316\354\335\353\373\340\331;"..., 4096, 1024 <unfinished ...> [pid 7079] 12:28:28 pwrite(23, "ta-Python/2.0.15\" 0.016\n211.147."..., 65536, 6624256 <unfinished ...> [pid 7081] 12:28:28 pwrite(23, "4034093?apikey=0eb695f25995d7eb2"..., 65536, 6689792 <unfinished ...> [pid 7084] 12:28:28 pwrite(23, " y8G23n95BKY:43534427:wind8vssc4"..., 65536, 6755328) = 65536 <0.000108> [pid 7078] 12:28:28 pwrite(23, "TkVvKuXfug:3248233:5Yo9vFoOIuo \""..., 65536, 6820864 <unfinished ...> [pid 7086] 12:28:28 pwrite(23, ":s|1563396:s|1040897:s|1395290:s"..., 65536, 6886400 <unfinished ...> [pid 7085] 12:28:28 pwrite(23, "dows%3B%20U%3B%20Windows%20NT%20"..., 65536, 6951936 <unfinished ...> [pid 7087] 12:28:28 pwrite(23, "/533.17.9 (KHTML, like Gecko) Ve"..., 65536, 7017472 <unfinished ...> [pid 7079] 12:28:28 pwrite(23, " r1m+tFW1T5M:: \"22/Feb/2012:00:0"..., 65536, 7083008 <unfinished ...> [pid 7086] 12:28:28 pwrite(19, "baggins5 61.174.60.117 i6MSCBvE1"..., 25159, 55751097 <unfinished ...> [pid 7084] 12:28:28 pwrite(20, "gardner1 182.118.7.64 TjxzPKdqNU"..., 10208, 34597920 <unfinished ...> [pid 7080] 12:28:28 pwrite(23, "d7eb2c23c1d70cc187c1&alt=json HT"..., 65536, 7148544 <unfinished ...> [pid 7083] 12:28:28 pwrite(23, "5_Google&type=n&channel=-3&user_"..., 65536, 7214080 <unfinished ...> [pid 7085] 12:28:28 pwrite(19, "12-02-22 12:28:27 1861 \"GET /ser"..., 23179, 55776256 <unfinished ...> [pid 7082] 12:28:28 pwrite(23, "\"http://douban.fm/swf/53035/fmpl"..., 65536, 7279616 <unfinished ...> [pid 7078] 12:28:28 pwrite(20, "opic/27639291/add_comment HTTP/1"..., 18576, 34608128 <unfinished ...> [pid 7087] 12:28:28 pwrite(19, "[\"[\4\5\266v\324\366\245n\t\315\202\227\\\343=\336-\r k)\316\354\335\353\373\340\331;"..., 4096, 1024 <unfinished ...> [pid 7079] 12:28:28 pwrite(23, "ww.douban.com%2Fgroup%2Ftopic%2F"..., 65536, 7345152 <unfinished ...> [pid 7081] 12:28:28 pwrite(20, "\255BYU\355\237\347\226s\261\307N{A\355\203S\306\244\255\322[\322\rJ\32[z3\31\311\327"..., 4096, 1024 <unfinished ...> [pid 7086] 12:28:28 pwrite(23, "patible; MSIE 7.0; Windows NT 6."..., 65536, 7410688 <unfinished ...> [pid 7084] 12:28:28 pwrite(23, "fari/535.7 360EE\" 0.006\n211.147."..., 65536, 7476224 <unfinished ...> [pid 7080] 12:28:28 pwrite(23, "1:OUIVR8CIG5c \"22/Feb/2012:00:03"..., 65536, 7541760 <unfinished ...> [pid 7085] 12:28:28 pwrite(23, "fm \"GET /j/mine/playlist?type=s&"..., 65536, 7607296 <unfinished ...> [pid 7083] 12:28:28 pwrite(23, "pe=n&channel=18&user_id=39266798"..., 65536, 7672832 <unfinished ...> [pid 7082] 12:28:28 pwrite(23, " 0.023\n125.34.190.128 :: \"22/Feb"..., 65536, 7738368 <unfinished ...> [pid 7078] 12:28:28 pwrite(23, "00 5859 \"http://www.douban.com/p"..., 65536, 7803904 <unfinished ...> [pid 7079] 12:28:28 pwrite(23, "03:08 +0800\" www.douban.com \"GET"..., 65536, 7869440 <unfinished ...> [pid 7086] 12:28:28 pwrite(23, "type=all HTTP/1.1\" 200 1492 \"-\" "..., 65536, 7934976 <unfinished ...> [pid 7084] 12:28:28 pwrite(23, "Hiapk&user_id=57982902&expire=13"..., 65536, 8000512 <unfinished ...> [pid 7080] 12:28:28 pwrite(23, "0.011\n116.253.89.216 rxASuWZf1wg"..., 65536, 8066048 <unfinished ...> [pid 7085] 12:28:28 pwrite(23, "9 +0800\" www.douban.com \"GET /ph"..., 65536, 8131584) = 65536 <0.000062> [pid 7083] 12:28:28 pwrite(23, " +0800\" www.douban.com \"GET /eve"..., 65536, 8197120 <unfinished ...> [pid 7082] 12:28:28 pwrite(23, " +0800\" www.douban.com \"POST /se"..., 65536, 8262656) = 65536 <0.000103> [pid 7087] 12:28:28 pwrite(23, "0 12971 \"http://www.douban.com/g"..., 65536, 8328192 <unfinished ...> [pid 7081] 12:28:28 pwrite(23, ".0 (compatible; MSIE 7.0; Window"..., 65536, 8393728) = 65536 <0.000065> In order to get better performance, the chunk server should merge the continuous sequential write operations into larger ones. -- - Davies ------------------------------------------------------------------------------ Virtualization & Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ _______________________________________________ moosefs-users mailing list moo...@li... https://lists.sourceforge.net/lists/listinfo/moosefs-users -- - Davies ------------------------------------------------------------------------------ Virtualization & Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ _______________________________________________ moosefs-users mailing list moo...@li... https://lists.sourceforge.net/lists/listinfo/moosefs-users -- - Davies |
From: Davies L. <dav...@gm...> - 2012-03-01 08:52:00
|
I can not figure it out how to make fsync frequency configurable, then move fsync() just before close(): --- mfs-1.6.26/mfschunkserver/hddspacemgr.c 2012-02-08 16:15:03.000000000 +0800 +++ mfs-1.6.26-r1/mfschunkserver/hddspacemgr.c 2012-03-01 16:17:23.000000000 +0800 @@ -1887,28 +1887,28 @@ errno = errmem; return status; } - ts = get_usectime(); -#ifdef F_FULLFSYNC - if (fcntl(c->fd,F_FULLFSYNC)<0) { - int errmem = errno; - mfs_arg_errlog_silent(LOG_WARNING,"hdd_io_end: file:%s - fsync (via fcntl) error",c->filename); - errno = errmem; - return ERROR_IO; - } -#else - if (fsync(c->fd)<0) { - int errmem = errno; - mfs_arg_errlog_silent(LOG_WARNING,"hdd_io_end: file:%s - fsync (direct call) error",c->filename); - errno = errmem; - return ERROR_IO; - } -#endif - te = get_usectime(); - hdd_stats_datafsync(c->owner,te-ts); } c->crcrefcount--; if (c->crcrefcount==0) { if (OPENSTEPS==0) { + ts = get_usectime(); +#ifdef F_FULLFSYNC + if (fcntl(c->fd,F_FULLFSYNC)<0) { + int errmem = errno; + mfs_arg_errlog_silent(LOG_WARNING,"hdd_io_end: file:%s - fsync (via fcntl) error",c->filename); + errno = errmem; + return ERROR_IO; + } +#else + if (fsync(c->fd)<0) { + int errmem = errno; + mfs_arg_errlog_silent(LOG_WARNING,"hdd_io_end: file:%s - fsync (direct call) error",c->filename); + errno = errmem; + return ERROR_IO; + } +#endif + te = get_usectime(); + hdd_stats_datafsync(c->owner,te-ts); if (close(c->fd)<0) { int errmem = errno; c->fd = -1; On Thu, Mar 1, 2012 at 12:21 PM, Chris Picton <ch...@ec...> wrote: > I have had similar ideas > > I currently have a patch to disable fsync on every block close, however > this will probably lead to data corruption if there is a site-wide power > outage. > > My thoughts are as follows: > > - Create config variable FLUSH_ON_WRITE (0/1) to disable or enable > flush on write > - Create config variable FLUSH_DELAY (seconds) to prevent flushing > immediately - rather the pointers to the written chunks would be stored, > and looped through (in a separate thread?), to flush any which are older > than the delay. This would ensure that the chunks have a maximum time > during which they may potentially be invalid on disk. If the FLUSH_DELAY is > 0, then behaviour is as current > - Create config variable CHECKSUM_INITIAL (0/1) If set to 1 would > force a checksum of *all* blocks on a chunkserver at startup, to find > potentially bad chunks before they are used. Is this necessary, though? > Are checksums read on every block read? > > I may start making the above patches, if I get time to do so. > > Chris > > > > On 2012/03/01 6:09 AM, Davies Liu wrote: > > Hi Michal! > > I have found the reason for bad write performance, the chunk server will > write the crc block > into disk EVERY second, then fsync(), which will take about 28ms. > > Can I reduce frequency of fsync() to every minute, then check the chunk > modified in recent > 30 minutes after booting? > > Davies > > 2012/2/23 Michał Borychowski <mic...@ge...> > >> Hi Davies! >> >> Here is our analysis of this situation. Different files are written >> simultaneously on the same CS - that's why pwrites are written do different >> files. Block size of 64kB is not that small. Writes in OS are sent through >> write cache so all saves being multiplication of 4096B should work equally >> fast. >> >> Our tests: >> >> dd on Linux 64k : 640k >> $ dd if=/dev/zero of=/tmp/test bs=64k count=10000 >> 10000+0 records in >> 10000+0 records out >> 655360000 bytes (655 MB) copied, 22.1836 s, 29.5 MB/s >> >> $ dd if=/dev/zero of=/tmp/test bs=640k count=1000 >> 1000+0 records in >> 1000+0 records out >> 655360000 bytes (655 MB) copied, 23.1311 s, 28.3 MB/s >> >> dd on Mac OS X 64k : 640k >> $ dd if=/dev/zero of=/tmp/test bs=64k count=10000 >> 10000+0 records in >> 10000+0 records out >> 655360000 bytes transferred in 14.874652 secs (44058846 bytes/sec) >> >> $ dd if=/dev/zero of=/tmp/test bs=640k count=1000 >> 1000+0 records in >> 1000+0 records out >> 655360000 bytes transferred in 14.578427 secs (44954096 bytes/sec) >> >> So the times are similar. Saves going to different files also should not >> be a problem as a kernel scheduler takes care of this. >> >> If you have some specific idea how to improve the saves please share it >> with us. >> >> >> Kind regards >> Michał >> >> -----Original Message----- >> From: Davies Liu [mailto:dav...@gm...] >> Sent: Wednesday, February 22, 2012 8:24 AM >> To: moo...@li... >> Subject: [Moosefs-users] Bad write performance of mfschunkserver >> >> Hi,devs: >> >> Today, We found that some mfschunkserver were not responsive, caused many >> timeout in mfsmount, then all the write operation were blocked. >> >> After some digging, we found that there were some small but continuous >> write bandwidth, strace show that many small pwrite() between several files: >> >> [pid 7087] 12:28:28 pwrite(19, "baggins3 60.210.18.235 sE7NtNQU7"..., >> 25995, 55684725 <unfinished ...> [pid 7078] 12:28:28 pwrite(17, >> "2012/02/22 12:28:28:root: WARNIN"..., 69, 21768909 <unfinished ...> [pid >> 7080] 12:28:28 pwrite(20, "gardner4 183.7.50.169 mr5vi+Z4H3"..., 47663, >> 34550257 <unfinished ...> [pid 7079] 12:28:28 pwrite(19, "\" \"Mozilla/5.0 >> (Windows NT 6.1) "..., 40377, 55710720 <unfinished ...> [pid 7086] >> 12:28:28 pwrite(23, "MATP; InfoPath.2; .NET4.0C; 360S"..., 65536, 6427648 >> <unfinished ...> [pid 7082] 12:28:28 pwrite(23, "; GTB7.2; SLCC2; .NET CLR >> 2.0.50"..., 65536, 6493184 <unfinished ...> [pid 7083] 12:28:28 pwrite(20, >> "\255BYU\355\237\347\226s\261\307N{A\355\203S\306\244\255\322[\322\rJ\32[z3\31\311\327"..., >> 4096, 1024 <unfinished ...> >> [pid 7078] 12:28:28 pwrite(23, "ovie/subject/4724373/reviews?sta"..., >> 65536, 6558720 <unfinished ...> >> [pid 7080] 12:28:28 pwrite(19, >> "[\"[\4\5\266v\324\366\245n\t\315\202\227\\\343=\336-\r >> k)\316\354\335\353\373\340\331;"..., 4096, 1024 <unfinished ...> [pid >> 7079] 12:28:28 pwrite(23, "ta-Python/2.0.15\" >> 0.016\n211.147."..., 65536, 6624256 <unfinished ...> [pid 7081] 12:28:28 >> pwrite(23, "4034093?apikey=0eb695f25995d7eb2"..., >> 65536, 6689792 <unfinished ...> >> [pid 7084] 12:28:28 pwrite(23, " y8G23n95BKY:43534427:wind8vssc4"..., >> 65536, 6755328) = 65536 <0.000108> >> [pid 7078] 12:28:28 pwrite(23, "TkVvKuXfug:3248233:5Yo9vFoOIuo \""..., >> 65536, 6820864 <unfinished ...> [pid 7086] 12:28:28 pwrite(23, >> ":s|1563396:s|1040897:s|1395290:s"..., >> 65536, 6886400 <unfinished ...> >> [pid 7085] 12:28:28 pwrite(23, "dows%3B%20U%3B%20Windows%20NT%20"..., >> 65536, 6951936 <unfinished ...> >> [pid 7087] 12:28:28 pwrite(23, "/533.17.9 (KHTML, like Gecko) Ve"..., >> 65536, 7017472 <unfinished ...> [pid 7079] 12:28:28 pwrite(23, " >> r1m+tFW1T5M:: >> \"22/Feb/2012:00:0"..., 65536, 7083008 <unfinished ...> [pid 7086] >> 12:28:28 pwrite(19, "baggins5 61.174.60.117 i6MSCBvE1"..., 25159, 55751097 >> <unfinished ...> [pid 7084] 12:28:28 pwrite(20, "gardner1 182.118.7.64 >> TjxzPKdqNU"..., 10208, 34597920 <unfinished ...> [pid 7080] 12:28:28 >> pwrite(23, "d7eb2c23c1d70cc187c1&alt=json HT"..., 65536, 7148544 >> <unfinished ...> [pid 7083] 12:28:28 pwrite(23, >> "5_Google&type=n&channel=-3&user_"..., >> 65536, 7214080 <unfinished ...> >> [pid 7085] 12:28:28 pwrite(19, "12-02-22 12:28:27 1861 \"GET /ser"..., >> 23179, 55776256 <unfinished ...> [pid 7082] 12:28:28 pwrite(23, "\" >> http://douban.fm/swf/53035/fmpl"..., 65536, 7279616 <unfinished ...> >> [pid 7078] 12:28:28 pwrite(20, "opic/27639291/add_comment HTTP/1"..., >> 18576, 34608128 <unfinished ...> [pid 7087] 12:28:28 pwrite(19, >> "[\"[\4\5\266v\324\366\245n\t\315\202\227\\\343=\336-\r >> k)\316\354\335\353\373\340\331;"..., 4096, 1024 <unfinished ...> [pid >> 7079] 12:28:28 pwrite(23, "ww.douban.com%2Fgroup%2Ftopic%2F"..., >> 65536, 7345152 <unfinished ...> >> [pid 7081] 12:28:28 pwrite(20, >> >> "\255BYU\355\237\347\226s\261\307N{A\355\203S\306\244\255\322[\322\rJ\32[z3\31\311\327"..., >> 4096, 1024 <unfinished ...> >> [pid 7086] 12:28:28 pwrite(23, "patible; MSIE 7.0; Windows NT 6."..., >> 65536, 7410688 <unfinished ...> [pid 7084] 12:28:28 pwrite(23, "fari/535.7 >> 360EE\" >> 0.006\n211.147."..., 65536, 7476224 <unfinished ...> [pid 7080] 12:28:28 >> pwrite(23, "1:OUIVR8CIG5c \"22/Feb/2012:00:03"..., 65536, 7541760 >> <unfinished ...> [pid 7085] 12:28:28 pwrite(23, "fm \"GET >> /j/mine/playlist?type=s&"..., 65536, 7607296 <unfinished ...> [pid 7083] >> 12:28:28 pwrite(23, "pe=n&channel=18&user_id=39266798"..., >> 65536, 7672832 <unfinished ...> >> [pid 7082] 12:28:28 pwrite(23, " 0.023\n125.34.190.128 :: >> \"22/Feb"..., 65536, 7738368 <unfinished ...> [pid 7078] 12:28:28 >> pwrite(23, "00 5859 \"http://www.douban.com/p"..., 65536, 7803904 >> <unfinished ...> [pid 7079] 12:28:28 pwrite(23, "03:08 +0800\" >> www.douban.com \"GET"..., 65536, 7869440 <unfinished ...> [pid 7086] >> 12:28:28 pwrite(23, "type=all HTTP/1.1\" 200 1492 \"-\" >> "..., 65536, 7934976 <unfinished ...> >> [pid 7084] 12:28:28 pwrite(23, "Hiapk&user_id=57982902&expire=13"..., >> 65536, 8000512 <unfinished ...> >> [pid 7080] 12:28:28 pwrite(23, "0.011\n116.253.89.216 rxASuWZf1wg"..., >> 65536, 8066048 <unfinished ...> [pid 7085] 12:28:28 pwrite(23, "9 +0800\" >> www.douban.com \"GET /ph"..., 65536, 8131584) = 65536 <0.000062> [pid >> 7083] 12:28:28 pwrite(23, " +0800\" www.douban.com \"GET /eve"..., >> 65536, 8197120 <unfinished ...> [pid 7082] 12:28:28 pwrite(23, " +0800\" >> www.douban.com \"POST /se"..., 65536, 8262656) = 65536 <0.000103> [pid >> 7087] 12:28:28 pwrite(23, "0 12971 \"http://www.douban.com/g"..., >> 65536, 8328192 <unfinished ...> [pid 7081] 12:28:28 pwrite(23, ".0 >> (compatible; MSIE 7.0; Window"..., 65536, 8393728) = 65536 <0.000065> >> >> In order to get better performance, the chunk server should merge the >> continuous sequential write operations into larger ones. >> >> -- >> - Davies >> >> >> ------------------------------------------------------------------------------ >> Virtualization & Cloud Management Using Capacity Planning Cloud computing >> makes use of virtualization - but cloud computing also focuses on allowing >> computing to be delivered as a service. >> http://www.accelacomm.com/jaw/sfnl/114/51521223/ >> _______________________________________________ >> moosefs-users mailing list >> moo...@li... >> https://lists.sourceforge.net/lists/listinfo/moosefs-users >> >> > > > -- > - Davies > > > ------------------------------------------------------------------------------ > Virtualization & Cloud Management Using Capacity Planning > Cloud computing makes use of virtualization - but cloud computing > also focuses on allowing computing to be delivered as a service.http://www.accelacomm.com/jaw/sfnl/114/51521223/ > > > > _______________________________________________ > moosefs-users mailing lis...@li...https://lists.sourceforge.net/lists/listinfo/moosefs-users > > > -- - Davies |
From: Serhan S. <se...@gm...> - 2012-03-01 08:38:15
|
From: Chris P. <ch...@ec...> - 2012-03-01 04:46:44
|
I have had similar ideas I currently have a patch to disable fsync on every block close, however this will probably lead to data corruption if there is a site-wide power outage. My thoughts are as follows: * Create config variable FLUSH_ON_WRITE (0/1) to disable or enable flush on write * Create config variable FLUSH_DELAY (seconds) to prevent flushing immediately - rather the pointers to the written chunks would be stored, and looped through (in a separate thread?), to flush any which are older than the delay. This would ensure that the chunks have a maximum time during which they may potentially be invalid on disk. If the FLUSH_DELAY is 0, then behaviour is as current * Create config variable CHECKSUM_INITIAL (0/1) If set to 1 would force a checksum of *all* blocks on a chunkserver at startup, to find potentially bad chunks before they are used. Is this necessary, though? Are checksums read on every block read? I may start making the above patches, if I get time to do so. Chris On 2012/03/01 6:09 AM, Davies Liu wrote: > Hi Michal! > > I have found the reason for bad write performance, the chunk server > will write the crc block > into disk EVERY second, then fsync(), which will take about 28ms. > > Can I reduce frequency of fsync() to every minute, then check the > chunk modified in recent > 30 minutes after booting? > > Davies > > 2012/2/23 Michał Borychowski <mic...@ge... > <mailto:mic...@ge...>> > > Hi Davies! > > Here is our analysis of this situation. Different files are > written simultaneously on the same CS - that's why pwrites are > written do different files. Block size of 64kB is not that small. > Writes in OS are sent through write cache so all saves being > multiplication of 4096B should work equally fast. > > Our tests: > > dd on Linux 64k : 640k > $ dd if=/dev/zero of=/tmp/test bs=64k count=10000 > 10000+0 records in > 10000+0 records out > 655360000 bytes (655 MB) copied, 22.1836 s, 29.5 MB/s > > $ dd if=/dev/zero of=/tmp/test bs=640k count=1000 > 1000+0 records in > 1000+0 records out > 655360000 bytes (655 MB) copied, 23.1311 s, 28.3 MB/s > > dd on Mac OS X 64k : 640k > $ dd if=/dev/zero of=/tmp/test bs=64k count=10000 > 10000+0 records in > 10000+0 records out > 655360000 bytes transferred in 14.874652 secs (44058846 bytes/sec) > > $ dd if=/dev/zero of=/tmp/test bs=640k count=1000 > 1000+0 records in > 1000+0 records out > 655360000 bytes transferred in 14.578427 secs (44954096 bytes/sec) > > So the times are similar. Saves going to different files also > should not be a problem as a kernel scheduler takes care of this. > > If you have some specific idea how to improve the saves please > share it with us. > > > Kind regards > Michał > > -----Original Message----- > From: Davies Liu [mailto:dav...@gm... > <mailto:dav...@gm...>] > Sent: Wednesday, February 22, 2012 8:24 AM > To: moo...@li... > <mailto:moo...@li...> > Subject: [Moosefs-users] Bad write performance of mfschunkserver > > Hi,devs: > > Today, We found that some mfschunkserver were not responsive, > caused many timeout in mfsmount, then all the write operation were > blocked. > > After some digging, we found that there were some small but > continuous write bandwidth, strace show that many small pwrite() > between several files: > > [pid 7087] 12:28:28 pwrite(19, "baggins3 60.210.18.235 > sE7NtNQU7"..., 25995, 55684725 <unfinished ...> [pid 7078] > 12:28:28 pwrite(17, "2012/02/22 12:28:28:root: WARNIN"..., 69, > 21768909 <unfinished ...> [pid 7080] 12:28:28 pwrite(20, > "gardner4 183.7.50.169 mr5vi+Z4H3"..., 47663, 34550257 <unfinished > ...> [pid 7079] 12:28:28 pwrite(19, "\" \"Mozilla/5.0 (Windows NT > 6.1) "..., 40377, 55710720 <unfinished ...> [pid 7086] 12:28:28 > pwrite(23, "MATP; InfoPath.2; .NET4.0C; 360S"..., 65536, 6427648 > <unfinished ...> [pid 7082] 12:28:28 pwrite(23, "; GTB7.2; SLCC2; > .NET CLR 2.0.50"..., 65536, 6493184 <unfinished ...> [pid 7083] > 12:28:28 pwrite(20, > "\255BYU\355\237\347\226s\261\307N{A\355\203S\306\244\255\322[\322\rJ\32[z3\31\311\327"..., > 4096, 1024 <unfinished ...> > [pid 7078] 12:28:28 pwrite(23, "ovie/subject/4724373/reviews?sta"..., > 65536, 6558720 <unfinished ...> > [pid 7080] 12:28:28 pwrite(19, > "[\"[\4\5\266v\324\366\245n\t\315\202\227\\\343=\336-\r > k)\316\354\335\353\373\340\331;"..., 4096, 1024 <unfinished ...> > [pid 7079] 12:28:28 pwrite(23, "ta-Python/2.0.15\" > 0.016\n211.147."..., 65536, 6624256 <unfinished ...> [pid 7081] > 12:28:28 pwrite(23, "4034093?apikey=0eb695f25995d7eb2"..., > 65536, 6689792 <unfinished ...> > [pid 7084] 12:28:28 pwrite(23, " y8G23n95BKY:43534427:wind8vssc4"..., > 65536, 6755328) = 65536 <0.000108> > [pid 7078] 12:28:28 pwrite(23, "TkVvKuXfug:3248233:5Yo9vFoOIuo > \""..., 65536, 6820864 <unfinished ...> [pid 7086] 12:28:28 > pwrite(23, ":s|1563396:s|1040897:s|1395290:s"..., > 65536, 6886400 <unfinished ...> > [pid 7085] 12:28:28 pwrite(23, "dows%3B%20U%3B%20Windows%20NT%20"..., > 65536, 6951936 <unfinished ...> > [pid 7087] 12:28:28 pwrite(23, "/533.17.9 (KHTML, like Gecko) > Ve"..., 65536, 7017472 <unfinished ...> [pid 7079] 12:28:28 > pwrite(23, " r1m+tFW1T5M:: > \"22/Feb/2012:00:0"..., 65536, 7083008 <unfinished ...> [pid > 7086] 12:28:28 pwrite(19, "baggins5 61.174.60.117 i6MSCBvE1"..., > 25159, 55751097 <unfinished ...> [pid 7084] 12:28:28 pwrite(20, > "gardner1 182.118.7.64 TjxzPKdqNU"..., 10208, 34597920 <unfinished > ...> [pid 7080] 12:28:28 pwrite(23, > "d7eb2c23c1d70cc187c1&alt=json HT"..., 65536, 7148544 <unfinished > ...> [pid 7083] 12:28:28 pwrite(23, > "5_Google&type=n&channel=-3&user_"..., > 65536, 7214080 <unfinished ...> > [pid 7085] 12:28:28 pwrite(19, "12-02-22 12:28:27 1861 \"GET > /ser"..., 23179, 55776256 <unfinished ...> [pid 7082] 12:28:28 > pwrite(23, "\"http://douban.fm/swf/53035/fmpl"..., 65536, 7279616 > <unfinished ...> [pid 7078] 12:28:28 pwrite(20, > "opic/27639291/add_comment HTTP/1"..., 18576, 34608128 <unfinished > ...> [pid 7087] 12:28:28 pwrite(19, > "[\"[\4\5\266v\324\366\245n\t\315\202\227\\\343=\336-\r > k)\316\354\335\353\373\340\331;"..., 4096, 1024 <unfinished ...> > [pid 7079] 12:28:28 pwrite(23, "ww.douban.com > <http://ww.douban.com>%2Fgroup%2Ftopic%2F"..., > 65536, 7345152 <unfinished ...> > [pid 7081] 12:28:28 pwrite(20, > "\255BYU\355\237\347\226s\261\307N{A\355\203S\306\244\255\322[\322\rJ\32[z3\31\311\327"..., > 4096, 1024 <unfinished ...> > [pid 7086] 12:28:28 pwrite(23, "patible; MSIE 7.0; Windows NT > 6."..., 65536, 7410688 <unfinished ...> [pid 7084] 12:28:28 > pwrite(23, "fari/535.7 360EE\" > 0.006\n211.147."..., 65536, 7476224 <unfinished ...> [pid 7080] > 12:28:28 pwrite(23, "1:OUIVR8CIG5c \"22/Feb/2012:00:03"..., 65536, > 7541760 <unfinished ...> [pid 7085] 12:28:28 pwrite(23, "fm \"GET > /j/mine/playlist?type=s&"..., 65536, 7607296 <unfinished ...> [pid > 7083] 12:28:28 pwrite(23, "pe=n&channel=18&user_id=39266798"..., > 65536, 7672832 <unfinished ...> > [pid 7082] 12:28:28 pwrite(23, " 0.023\n125.34.190.128 :: > \"22/Feb"..., 65536, 7738368 <unfinished ...> [pid 7078] 12:28:28 > pwrite(23, "00 5859 \"http://www.douban.com/p"..., 65536, 7803904 > <unfinished ...> [pid 7079] 12:28:28 pwrite(23, "03:08 +0800\" > www.douban.com <http://www.douban.com> \"GET"..., 65536, 7869440 > <unfinished ...> [pid 7086] 12:28:28 pwrite(23, "type=all > HTTP/1.1\" 200 1492 \"-\" > "..., 65536, 7934976 <unfinished ...> > [pid 7084] 12:28:28 pwrite(23, "Hiapk&user_id=57982902&expire=13"..., > 65536, 8000512 <unfinished ...> > [pid 7080] 12:28:28 pwrite(23, "0.011\n116.253.89.216 > rxASuWZf1wg"..., 65536, 8066048 <unfinished ...> [pid 7085] > 12:28:28 pwrite(23, "9 +0800\" www.douban.com > <http://www.douban.com> \"GET /ph"..., 65536, 8131584) = 65536 > <0.000062> [pid 7083] 12:28:28 pwrite(23, " +0800\" > www.douban.com <http://www.douban.com> \"GET /eve"..., 65536, > 8197120 <unfinished ...> [pid 7082] 12:28:28 pwrite(23, " +0800\" > www.douban.com <http://www.douban.com> \"POST /se"..., 65536, > 8262656) = 65536 <0.000103> [pid 7087] 12:28:28 pwrite(23, "0 > 12971 \"http://www.douban.com/g"..., 65536, 8328192 <unfinished > ...> [pid 7081] 12:28:28 pwrite(23, ".0 (compatible; MSIE 7.0; > Window"..., 65536, 8393728) = 65536 <0.000065> > > In order to get better performance, the chunk server should merge > the continuous sequential write operations into larger ones. > > -- > - Davies > > ------------------------------------------------------------------------------ > Virtualization & Cloud Management Using Capacity Planning Cloud > computing makes use of virtualization - but cloud computing also > focuses on allowing computing to be delivered as a service. > http://www.accelacomm.com/jaw/sfnl/114/51521223/ > _______________________________________________ > moosefs-users mailing list > moo...@li... > <mailto:moo...@li...> > https://lists.sourceforge.net/lists/listinfo/moosefs-users > > > > > -- > - Davies > > > ------------------------------------------------------------------------------ > Virtualization& Cloud Management Using Capacity Planning > Cloud computing makes use of virtualization - but cloud computing > also focuses on allowing computing to be delivered as a service. > http://www.accelacomm.com/jaw/sfnl/114/51521223/ > > > _______________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users |
From: Davies L. <dav...@gm...> - 2012-03-01 04:09:32
|
Hi Michal! I have found the reason for bad write performance, the chunk server will write the crc block into disk EVERY second, then fsync(), which will take about 28ms. Can I reduce frequency of fsync() to every minute, then check the chunk modified in recent 30 minutes after booting? Davies 2012/2/23 Michał Borychowski <mic...@ge...> > Hi Davies! > > Here is our analysis of this situation. Different files are written > simultaneously on the same CS - that's why pwrites are written do different > files. Block size of 64kB is not that small. Writes in OS are sent through > write cache so all saves being multiplication of 4096B should work equally > fast. > > Our tests: > > dd on Linux 64k : 640k > $ dd if=/dev/zero of=/tmp/test bs=64k count=10000 > 10000+0 records in > 10000+0 records out > 655360000 bytes (655 MB) copied, 22.1836 s, 29.5 MB/s > > $ dd if=/dev/zero of=/tmp/test bs=640k count=1000 > 1000+0 records in > 1000+0 records out > 655360000 bytes (655 MB) copied, 23.1311 s, 28.3 MB/s > > dd on Mac OS X 64k : 640k > $ dd if=/dev/zero of=/tmp/test bs=64k count=10000 > 10000+0 records in > 10000+0 records out > 655360000 bytes transferred in 14.874652 secs (44058846 bytes/sec) > > $ dd if=/dev/zero of=/tmp/test bs=640k count=1000 > 1000+0 records in > 1000+0 records out > 655360000 bytes transferred in 14.578427 secs (44954096 bytes/sec) > > So the times are similar. Saves going to different files also should not > be a problem as a kernel scheduler takes care of this. > > If you have some specific idea how to improve the saves please share it > with us. > > > Kind regards > Michał > > -----Original Message----- > From: Davies Liu [mailto:dav...@gm...] > Sent: Wednesday, February 22, 2012 8:24 AM > To: moo...@li... > Subject: [Moosefs-users] Bad write performance of mfschunkserver > > Hi,devs: > > Today, We found that some mfschunkserver were not responsive, caused many > timeout in mfsmount, then all the write operation were blocked. > > After some digging, we found that there were some small but continuous > write bandwidth, strace show that many small pwrite() between several files: > > [pid 7087] 12:28:28 pwrite(19, "baggins3 60.210.18.235 sE7NtNQU7"..., > 25995, 55684725 <unfinished ...> [pid 7078] 12:28:28 pwrite(17, > "2012/02/22 12:28:28:root: WARNIN"..., 69, 21768909 <unfinished ...> [pid > 7080] 12:28:28 pwrite(20, "gardner4 183.7.50.169 mr5vi+Z4H3"..., 47663, > 34550257 <unfinished ...> [pid 7079] 12:28:28 pwrite(19, "\" \"Mozilla/5.0 > (Windows NT 6.1) "..., 40377, 55710720 <unfinished ...> [pid 7086] > 12:28:28 pwrite(23, "MATP; InfoPath.2; .NET4.0C; 360S"..., 65536, 6427648 > <unfinished ...> [pid 7082] 12:28:28 pwrite(23, "; GTB7.2; SLCC2; .NET CLR > 2.0.50"..., 65536, 6493184 <unfinished ...> [pid 7083] 12:28:28 pwrite(20, > "\255BYU\355\237\347\226s\261\307N{A\355\203S\306\244\255\322[\322\rJ\32[z3\31\311\327"..., > 4096, 1024 <unfinished ...> > [pid 7078] 12:28:28 pwrite(23, "ovie/subject/4724373/reviews?sta"..., > 65536, 6558720 <unfinished ...> > [pid 7080] 12:28:28 pwrite(19, > "[\"[\4\5\266v\324\366\245n\t\315\202\227\\\343=\336-\r > k)\316\354\335\353\373\340\331;"..., 4096, 1024 <unfinished ...> [pid > 7079] 12:28:28 pwrite(23, "ta-Python/2.0.15\" > 0.016\n211.147."..., 65536, 6624256 <unfinished ...> [pid 7081] 12:28:28 > pwrite(23, "4034093?apikey=0eb695f25995d7eb2"..., > 65536, 6689792 <unfinished ...> > [pid 7084] 12:28:28 pwrite(23, " y8G23n95BKY:43534427:wind8vssc4"..., > 65536, 6755328) = 65536 <0.000108> > [pid 7078] 12:28:28 pwrite(23, "TkVvKuXfug:3248233:5Yo9vFoOIuo \""..., > 65536, 6820864 <unfinished ...> [pid 7086] 12:28:28 pwrite(23, > ":s|1563396:s|1040897:s|1395290:s"..., > 65536, 6886400 <unfinished ...> > [pid 7085] 12:28:28 pwrite(23, "dows%3B%20U%3B%20Windows%20NT%20"..., > 65536, 6951936 <unfinished ...> > [pid 7087] 12:28:28 pwrite(23, "/533.17.9 (KHTML, like Gecko) Ve"..., > 65536, 7017472 <unfinished ...> [pid 7079] 12:28:28 pwrite(23, " > r1m+tFW1T5M:: > \"22/Feb/2012:00:0"..., 65536, 7083008 <unfinished ...> [pid 7086] > 12:28:28 pwrite(19, "baggins5 61.174.60.117 i6MSCBvE1"..., 25159, 55751097 > <unfinished ...> [pid 7084] 12:28:28 pwrite(20, "gardner1 182.118.7.64 > TjxzPKdqNU"..., 10208, 34597920 <unfinished ...> [pid 7080] 12:28:28 > pwrite(23, "d7eb2c23c1d70cc187c1&alt=json HT"..., 65536, 7148544 > <unfinished ...> [pid 7083] 12:28:28 pwrite(23, > "5_Google&type=n&channel=-3&user_"..., > 65536, 7214080 <unfinished ...> > [pid 7085] 12:28:28 pwrite(19, "12-02-22 12:28:27 1861 \"GET /ser"..., > 23179, 55776256 <unfinished ...> [pid 7082] 12:28:28 pwrite(23, "\" > http://douban.fm/swf/53035/fmpl"..., 65536, 7279616 <unfinished ...> [pid > 7078] 12:28:28 pwrite(20, "opic/27639291/add_comment HTTP/1"..., 18576, > 34608128 <unfinished ...> [pid 7087] 12:28:28 pwrite(19, > "[\"[\4\5\266v\324\366\245n\t\315\202\227\\\343=\336-\r > k)\316\354\335\353\373\340\331;"..., 4096, 1024 <unfinished ...> [pid > 7079] 12:28:28 pwrite(23, "ww.douban.com%2Fgroup%2Ftopic%2F"..., > 65536, 7345152 <unfinished ...> > [pid 7081] 12:28:28 pwrite(20, > > "\255BYU\355\237\347\226s\261\307N{A\355\203S\306\244\255\322[\322\rJ\32[z3\31\311\327"..., > 4096, 1024 <unfinished ...> > [pid 7086] 12:28:28 pwrite(23, "patible; MSIE 7.0; Windows NT 6."..., > 65536, 7410688 <unfinished ...> [pid 7084] 12:28:28 pwrite(23, "fari/535.7 > 360EE\" > 0.006\n211.147."..., 65536, 7476224 <unfinished ...> [pid 7080] 12:28:28 > pwrite(23, "1:OUIVR8CIG5c \"22/Feb/2012:00:03"..., 65536, 7541760 > <unfinished ...> [pid 7085] 12:28:28 pwrite(23, "fm \"GET > /j/mine/playlist?type=s&"..., 65536, 7607296 <unfinished ...> [pid 7083] > 12:28:28 pwrite(23, "pe=n&channel=18&user_id=39266798"..., > 65536, 7672832 <unfinished ...> > [pid 7082] 12:28:28 pwrite(23, " 0.023\n125.34.190.128 :: > \"22/Feb"..., 65536, 7738368 <unfinished ...> [pid 7078] 12:28:28 > pwrite(23, "00 5859 \"http://www.douban.com/p"..., 65536, 7803904 > <unfinished ...> [pid 7079] 12:28:28 pwrite(23, "03:08 +0800\" > www.douban.com \"GET"..., 65536, 7869440 <unfinished ...> [pid 7086] > 12:28:28 pwrite(23, "type=all HTTP/1.1\" 200 1492 \"-\" > "..., 65536, 7934976 <unfinished ...> > [pid 7084] 12:28:28 pwrite(23, "Hiapk&user_id=57982902&expire=13"..., > 65536, 8000512 <unfinished ...> > [pid 7080] 12:28:28 pwrite(23, "0.011\n116.253.89.216 rxASuWZf1wg"..., > 65536, 8066048 <unfinished ...> [pid 7085] 12:28:28 pwrite(23, "9 +0800\" > www.douban.com \"GET /ph"..., 65536, 8131584) = 65536 <0.000062> [pid > 7083] 12:28:28 pwrite(23, " +0800\" www.douban.com \"GET /eve"..., > 65536, 8197120 <unfinished ...> [pid 7082] 12:28:28 pwrite(23, " +0800\" > www.douban.com \"POST /se"..., 65536, 8262656) = 65536 <0.000103> [pid > 7087] 12:28:28 pwrite(23, "0 12971 \"http://www.douban.com/g"..., 65536, > 8328192 <unfinished ...> [pid 7081] 12:28:28 pwrite(23, ".0 (compatible; > MSIE 7.0; Window"..., 65536, 8393728) = 65536 <0.000065> > > In order to get better performance, the chunk server should merge the > continuous sequential write operations into larger ones. > > -- > - Davies > > > ------------------------------------------------------------------------------ > Virtualization & Cloud Management Using Capacity Planning Cloud computing > makes use of virtualization - but cloud computing also focuses on allowing > computing to be delivered as a service. > http://www.accelacomm.com/jaw/sfnl/114/51521223/ > _______________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users > > -- - Davies |
From: lingjie <li...@gy...> - 2012-03-01 01:28:27
|
I think it is necessary to do so. > On 12-02-29 3:59 AM, lingjie wrote: >> Hi, all >> >> If a mfs client also is a mfs chunk, when it read from or write to the >> mfs, how master will be to choice ? Random ? Or select the client self ? >> >> Regards. >> > Yes, currently the selection is based on where ever there are chunks are > available. the system does not have location aware abilities yet.. > > ------------------------------------------------------------------------------ > Virtualization& Cloud Management Using Capacity Planning > Cloud computing makes use of virtualization - but cloud computing > also focuses on allowing computing to be delivered as a service. > http://www.accelacomm.com/jaw/sfnl/114/51521223/ > _______________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users > -- ?? Seven Ling Linux engineer BingJing Guang Yu-Online Technology co.,Ltd address:YingChuangDongLi A502 ShangDi BeiJing China Mob:15110179720 QQ:85376415 MSN:sev...@ho... com website:www.gyyx.cn |
From: Travis H. <tra...@tr...> - 2012-02-29 19:25:35
|
On 12-02-29 3:59 AM, lingjie wrote: > Hi, all > > If a mfs client also is a mfs chunk, when it read from or write to the > mfs, how master will be to choice ? Random ? Or select the client self ? > > Regards. > Yes, currently the selection is based on where ever there are chunks are available. the system does not have location aware abilities yet.. |
From: jose m. <let...@us...> - 2012-02-29 17:19:19
|
El lun, 27-02-2012 a las 21:50 +0100, Michał Borychowski escribió: > Hi Jose! > > But we cannot see any "warnings" in the attached files...? Where exactly is the problem? > > > Regards > Michal > * Ups ...., sorry .. gcc -DHAVE_CONFIG_H -I. -I.. -I../mfsmaster -I../mfscommon -DAPPNAME=mfsmetarestore -DMETARESTORE -D_GNU_SOURCE -std=c99 -g -O2 -W -Wall -Wshadow -pedantic -MT filesystem.o -MD -MP -MF .deps/filesystem.Tpo -c -o filesystem.o `test -f '../mfsmaster/filesystem.c' || echo './'`../mfsmaster/filesystem.c ../mfsmaster/filesystem.c: In function ‘fs_storeedge’: ../mfsmaster/filesystem.c:6000:9: warning: variable ‘happy’ set but not used [-Wunused-but-set-variable] ../mfsmaster/filesystem.c: In function ‘fs_storenode’: ../mfsmaster/filesystem.c:6166:9: warning: variable ‘happy’ set but not used [-Wunused-but-set-variable] ../mfsmaster/filesystem.c: In function ‘fs_storefree’: ../mfsmaster/filesystem.c:6848:9: warning: variable ‘happy’ set but not used [-Wunused-but-set-variable] ../mfsmaster/filesystem.c: In function ‘fs_store’: ../mfsmaster/filesystem.c:6917:9: warning: variable ‘happy’ set but not used [-Wunused-but-set-variable] ../mfsmaster/filesystem.c: In function ‘fs_emergency_storeall’: ../mfsmaster/filesystem.c:7116:9: warning: variable ‘happy’ set but not used [-Wunused-but-set-variable] ../mfsmaster/filesystem.c: In function ‘fs_storeall’: ../mfsmaster/filesystem.c:7274:9: warning: variable ‘happy’ set but not used [-Wunused-but-set-variable] ..... .... * As a result I had a problem with cgiserv. |
From: Florent B. <fl...@co...> - 2012-02-29 15:08:09
|
Of course shared FS in Proxmox MUST be mounted on every host! MFS too. On 02/29/2012 02:46 PM, Steve wrote: > I have a proxmox master cluster node with vm and iso pool storage mounted on > mfs runs fine. They don't show on a proxmox cluster slave do I need mfs > mount on both machines. I have tried the shared option in the storage setup > doesn't help. The pve storage config on the slave looks ok. > > > > Googled proxmox with no joy so just trying here. > > > > Steve > > ------------------------------------------------------------------------------ > Virtualization& Cloud Management Using Capacity Planning > Cloud computing makes use of virtualization - but cloud computing > also focuses on allowing computing to be delivered as a service. > http://www.accelacomm.com/jaw/sfnl/114/51521223/ > _______________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users |
From: Steve <st...@bo...> - 2012-02-29 13:45:46
|
I have a proxmox master cluster node with vm and iso pool storage mounted on mfs runs fine. They don't show on a proxmox cluster slave do I need mfs mount on both machines. I have tried the shared option in the storage setup doesn't help. The pve storage config on the slave looks ok. Googled proxmox with no joy so just trying here. Steve |
From: lingjie <li...@gy...> - 2012-02-29 09:16:44
|
Hi, all If a mfs client also is a mfs chunk, when it read from or write to the mfs, how master will be to choice ? Random ? Or select the client self ? Regards. -- Seven Ling Linux engineer BingJing Guang Yu-Online Technology co.,Ltd |
From: Ricardo J. B. <ric...@da...> - 2012-02-28 23:41:58
|
Sorry, I didn't read your email until now. You might be hitting your mfsmaster too hard, as it tries to recreate nonexistent chunks for later deletion. If that's the case, keep reading but first of all, you should first backup your metadata files (usually located in /var/mfs or /var/lib/mfs). El Martes 28/02/2012, Steve Thompson escribió: > MFS 1.6.20, CentOS 5.7 64-bit. Four chunk servers. > > I have an MFS setup that was working without any issues, until yesterday > when the mfsmaster hung. I shut down the master and the chunkservers. The > master starts properly and everything looks good until I start one of the > chunkservers (any one), at which point the master hangs. This happened to me once and I also took down every server of the cluster, 9 chunkservers and one dedicated metalogger (previously, I unmounted all the clients, about 250). Bad idea: when the master came on-line again and I started one chunkserver, the master went "crazy" triyng to recreate empty chunks for later deletion. My "solution" was to start all the chunkservers at the same time, so the master saw all the chunks almost simultaneously and didn't try to create empty chunks. After a while (can't remember but it was half an hour to an hour) the master and chunkservers calmed down and the cluster was functioning proerly again. Fortunately for me, it was our MFS for backups, so a few hours off-line or data loss wasn't a real problem, but you should probably be carefull and make a backup first of your mfsmaster files > I would appreciate a clue as to where to look, as I have not changed > anything from the working configuration, and I can't find anything obviously > wrong. Of course this is an emergency. > > Steve Hope it helps, -- Ricardo J. Barberis Senior SysAdmin / ITI Dattatec.com :: Soluciones de Web Hosting Tu Hosting hecho Simple! ------------------------------------------ |
From: Ricardo J. B. <ric...@da...> - 2012-02-28 22:56:26
|
El Sábado 25/02/2012, Corin Langosch escribió: > Hola Ricardo, Hallo! > I just managed to fix the issue: the problem was I didn't use any spaces > between the key and the value, so "AAA=BBB" is silently ignored while > "AAA = BBB" works fine. Good catch! I was just wondering if it would be possible to use "KEY=VALUE" (without spaces) so I could source the config files from bash and use KEY as a shell variable. I guess I'll have to find another way. > I'd suggest to at least emit a warning on the console/ syslog when an > invalid key is found. If the code'd already be hosted on github I'd even > submit a patch/ pull rquest ;-) > > Thanks and regards, > Corin Tahnk you, -- Ricardo J. Barberis Senior SysAdmin / ITI Dattatec.com :: Soluciones de Web Hosting Tu Hosting hecho Simple! ------------------------------------------ |
From: Ricardo J. B. <ric...@da...> - 2012-02-28 22:49:32
|
El Sábado 25/02/2012, Stas Oskin escribió: > Moreover, I noticed that running mount twice, causes double fuse mount, > with unknown effects on system behavior. > Meaning the MFS client unable to detect that it already mounted, and just > mounts again. > > Will the additional mounting in rc.local cause such effect? Yes, you're right. I should have mentioned it but it slipded from my mind: mfsmount has an option (-o nonempty: allow mounts over non-empty file/dir) that I believe is off by default, that instructs fuse to mount anyway, causing double-mounts. In CentOS 5 and CentOS 6, mfs 1.6.20 installed from repoforge.org it doesn't double-mount: [root@mds01 ~] # cat /etc/redhat-release CentOS release 6.2 (Final) [root@mds01 ~] # grep ^mfs /etc/fstab mfsmount /mnt/mfs fuse defaults,noatime,_netdev,mfsmaster=mds01,mfssubfolder=/ 0 0 [root@mds01 ~] # mount -a -t fuse mfsmaster accepted connection with parameters: read-write,restricted_ip ; root mapped to root:root fuse: mountpoint is not empty fuse: if you are sure this is safe, use the 'nonempty' mount option error in fuse_mount [root@bkpmds01 ~] # cat /etc/redhat-release CentOS release 5.7 (Final) [root@bkpmds01 ~] # grep ^mfs /etc/fstab mfsmount /mnt/mfs fuse defaults,noatime,_netdev,mfsmaster=bkpmds01,mfssubfolder=/ 0 0 [root@bkpmds01 ~] # mount -a -t fuse mfsmaster accepted connection with parameters: read-write,restricted_ip ; root mapped to root:root fuse: mountpoint is not empty fuse: if you are sure this is safe, use the 'nonempty' mount option error in fuse_mount > Regards. > > On Sat, Feb 25, 2012 at 10:56 AM, Stas Oskin <sta...@gm...> wrote: > > Hi. > > > > So you advice both having mount -a -t fuse in rc.local, and chkconfig > > netfs for Linux systems? > > > > Regards. > > > > > > On Fri, Feb 24, 2012 at 1:18 AM, Ricardo J. Barberis < > > > > ric...@da...> wrote: > >> Also, beware of gigabit networks: some NICs are very slow at > >> establishing link > >> and mfsmount (or any other network filesystem) will fail mounting > >> anyway, even using _netdev on fstab as suggested by Michał. > >> > >> I have 'mount -a -t fuse' in /etc/rc.local just beacuse of this. > >> > >> And, if you use _netdev you also have to enable mounting network > >> flesystems on > >> boot: 'chkconfig netfs on' in RedHat and derivatives should do the > >> trick. > >> > >> > >> PS: Michał, could you add that last bit about netfs to the Reference > >> Guide? > >> Thanks! > >> > >> El Martes 21/02/2012, Michał Borychowski escribió: > >> > Hi Stas! > >> > > >> > What platform do you use? The solution with /etc/fstab works only on > >> > >> Linux > >> > >> > platforms (tested on Debian). On other platforms you need to prepare a > >> > script in /usr/local/etc/rc.d which will run mfsmount with needed > >> > >> options. > >> > >> > And on Linux you need _netdev option, as written on > >> > http://www.moosefs.org/reference-guide.html in "Mounting the File > >> > >> System" > >> > >> > section. > >> > > >> > Kind regards > >> > Michal > >> > > >> > From: Stas Oskin [mailto:sta...@gm...] > >> > Sent: Sunday, January 15, 2012 10:01 AM > >> > To: MooseFS > >> > Subject: Connection timed out at mfsmount on boot > >> > > >> > Hi. > >> > > >> > We have the mfsmount set in fstab, but noticed it never properly works > >> > >> on > >> > >> > boot. -- Ricardo J. Barberis Senior SysAdmin / ITI Dattatec.com :: Soluciones de Web Hosting Tu Hosting hecho Simple! ------------------------------------------ |
From: Steve T. <sm...@cb...> - 2012-02-28 13:45:25
|
MFS 1.6.20, CentOS 5.7 64-bit. Four chunk servers. I have an MFS setup that was working without any issues, until yesterday when the mfsmaster hung. I shut down the master and the chunkservers. The master starts properly and everything looks good until I start one of the chunkservers (any one), at which point the master hangs. I would appreciate a clue as to where to look, as I have not changed anything from the working configuration, and I can't find anything obviously wrong. Of course this is an emergency. Steve |
From: René P. <ly...@lu...> - 2012-02-28 11:57:06
|
Hello! Here's a follow-up and a warning to everyone deploying servers. It's not meant to be yet another war story, it should illustrate what to look for when deploying MooseFS master servers. On Dec 11, 2011 at 1913 +0100, René Pfeiffer appeared and said: > … > We have the following scenario with a MooseFS deployment. > > - 3 servers > - server #3 : 1 master > - server #2 : chunk server > - server #1 : chunk server, one metalogger process running on this node > > Server #3 suffered from a hardware RAID failure including a trashed JFS > file system where the master logs were on. … We have spent a couple of days diagnosing the failover with the server vendor and the ISP that did the provisioning. Apparently the RAID meltdown was due to firmware bugs since the servers were deployed without any upgrades (classic case of communication failure). After recovery all firmwares on the servers were upgrades. After that the RAID failed again. This time the controller removed 2 out of 4 disks, because they weren't responding. Since th 2 removed disks were a complete RAID1 container the server went out of service again (this time without being master but only metalogger). Analysis yielded that storage server #3 is the only one with 2 TB disks which were _not_ approved by the hardware vendor (ISP used third-party disks to boost storage capacity). Storage servers #1 and #2 run 1 TB disks approved by the hardware vendor. Apparently the firmwares of the RAID controller do not like the 2 TB disks and their firmware leading to timeouts and communication errors on the data bus. Server types and disk models are available (please ask me off-list) if anyone is interested. We haven't figured out why the metalogger data was not useful after the first failure, but we suspect that due to the massive data corruption on storage server #3 the data sent to the metalogger was corrupt as well. I don't know if the master sends data from disk or from memory to the metalogger(s). If it reads data from disks and sends it, then our RAID controller might have eaten the data already. Best regards, René Pfeiffer. -- )\._.,--....,'``. fL Let GNU/Linux work for you while you take a nap. /, _.. \ _\ (`._ ,. R. Pfeiffer <lynx at luchs.at> + http://web.luchs.at/ `._.-(,_..'--(,_..'`-.;.' - System administration + Consulting + Teaching - Got mail delivery problems? http://web.luchs.at/information/blockedmail.php |
From: Chris P. <ch...@ec...> - 2012-02-28 08:09:54
|
Hi All I have noticed that if I restart a chunkserver, when it rejoins, the cgi shows that some of the chunks are undergoal (about 100 or so, depending on how long it was offline for) I assume this is because chunks are changing while the chunkserver is offline, and it has outdated copies. Most of the undergoal chunks are re-replicated fairly quickly (a minute or two), but I often see a few chunks that take a longer time to get replicated (up to an hour or more) I can see that often this happens to the same chunks (same ID). In my case, this chunk came up undergoal a lot while I was restarting my chunkservers: ndb-test1-02.os.img chunk 224: 0000000000001068_00000036 / (id:4200 ver:54) copy 1: 10.168.8.54:9422 I also had been seeing the following in my logs: replicator: got status: 19 from (XXXXX) 19 is wrong chunk version. I am assuming that the replicator is trying to replicate that chunk, but as it is changing so often, by the time the replicator has copied the data, the copy is invalid, so is not used. Can someone confirm my thoughts above? Would it be useful to have a patch force replication of a block after X number of failed attempts (by locking the source chunk for a short while, to ensure that replication happens)? Regards Chris |
From: <nl...@si...> - 2012-02-28 05:08:19
|
On messages have some wrong.Please tell me why? Thank you Feb 28 13:05:22 hadoop-grid3-node33 mfsmount[5190]: file: 8, index: 0 - fs_writechunk returns status 11 Feb 28 13:05:22 hadoop-grid3-node33 mfsmount[5190]: file: 5, index: 0 - fs_writechunk returns status 11 Feb 28 13:05:22 hadoop-grid3-node33 mfsmount[5190]: file: 8, index: 0 - fs_writechunk returns status 11 Feb 28 13:05:23 hadoop-grid3-node33 mfsmount[5190]: file: 5, index: 0 - fs_writechunk returns status 11 |