From: W K. <wk...@bn...> - 2011-05-23 22:32:55
|
Last Night, we had one of our 4 chunkservers 'lock up' in some mysterious way. The master started giving off these messages May 22 22:37:50 mfs1master mfsmaster[2522]: (192.168.0.24:9422) chunk: 000000000023DB7F replication status: 22 May 22 22:37:51 mfs1master mfsmaster[2522]: (192.168.0.24:9422) chunk: 0000000000149C9D replication status: 22 May 22 22:37:51 mfs1master mfsmaster[2522]: (192.168.0.24:9422) chunk: 00000000002EB8F6 deletion status: 22 May 22 22:38:26 mfs1master mfsmaster[2522]: connection with ML(192.168.0.24) has been closed by peer May 23 11:12:11 mfs1master mfsmaster[2522]: chunkserver disconnected - ip: 192.168.0.24, port: 9422, usedspace: 0 (0.00 GiB), totalspace: 0 (0.00 GiB) MooseFS did the right thing, and kicked the chunkserver out. There was no interruption of service and we didn't even notice the problem until someone looked at the CGI this morning and saw that we had a large number of undergoal (goal=2) files which moose was fixing (and had been fixing all night) the undergoal condition at a rate of about 2-4 chunks a second. So we replaced the failed chunkserver and continued on, quite content with how resiliant MooseFS was under a failure. We then thought about it and decided that we had gone a long time with only 1 copy of a large number of chunks and that perhaps a goal of 3 would have been safer. (i.e. if 1 of the 4 chunkservers dies, we still have 2 copies and could still lose a second chunkserver without harm). So we reset the Goal from 2 to 3. We did this while were still in an undergoal position at goal=2 for about 10,000 chunks that hadn't yet been healed. So now the CGI is showing 10,000+ chunks with a single copy (red), 2 million+ chunks are now orange (2 copies) and the system is happily increasing the 'green' 3 valid copy column. The problem is that it seems to be concentrating on the orange (2 copy) files and ignoring the 10,000+ red ones that are most at risk. In the last hour we've seen a few 'red' chunks disappear but the vast majority of activity is occuring in the orange (2 copy) column. Shouldn't the replication worry about the single copy files first? I also realize we could simply set the goal back to 2 let it finish that up and THEN switch it to 3 but I'm curious as to what the community says. -WK |
From: WK <wk...@bn...> - 2011-05-26 00:41:22
|
On 5/23/2011 3:32 PM, W Kern wrote: > So now the CGI is showing 10,000+ chunks with a single copy (red), 2 > million+ chunks are now orange (2 copies) and the system is happily > increasing the 'green' 3 valid copy column. > > The problem is that it seems to be concentrating on the orange (2 copy) > files and ignoring the 10,000+ red ones that are most at risk. In the > last hour we've seen a few 'red' chunks > disappear but the vast majority of activity is occuring in the orange (2 > copy) column. > > Shouldn't the replication worry about the single copy files first? > > I also realize we could simply set the goal back to 2 let it finish that > up and THEN switch it to 3 but I'm curious as to what the community says. > Just a followup, after a day or so we still had "red" single copy chunks. Obviously the under goal routine doesn't look at how badly under goal a given chunk is. So we dropped the Goal back down to 2 and MFS immediately focused on the single copy chunks. The only problem observed was that shortly after dropping the goal back down to 2, the mount complained of connection issues and people were kicked out of their IMAP sessions. That condition returned to normal less than a minute later and no files were lost. Once the under goal of 2 was completed an hour or so later, we reset the Goal to 3 and in a few days we should be fully green. In the meantime, we have at least two copies and are not vulnerable to an additional failure. I would still suggest that the "under goal" routine might want to look first at those chunks that are more severely out of goal, then go back and fix the others. Assuming that doesn't impact overall performance. -bill |
From: Michal B. <mic...@ge...> - 2011-06-08 09:10:24
|
Hi! Thanks for pointing this out. Yes, it would be really important to make some priorities in the rebalancing process which we put on our to-do list. We have this plan for prioritizing: - high priority (chunks with one copy with goal>1; orange color in CGI monitor) - middle priority (chunks undergoal and overgoal; yellow and blue color) - low priority (normal and deleted chunks; green and grey color) - "special" priority (chunks with 0 copies; red color) What is your opinion about it? Kind regards Michał Borychowski MooseFS Support Manager _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Gemius S.A. ul. Wołoska 7, 02-672 Warszawa Budynek MARS, klatka D Tel.: +4822 874-41-00 Fax : +4822 874-41-01 -----Original Message----- From: WK [mailto:wk...@bn...] Sent: Thursday, May 26, 2011 2:41 AM To: moo...@li... Subject: Re: [Moosefs-users] Replication Priority? On 5/23/2011 3:32 PM, W Kern wrote: > So now the CGI is showing 10,000+ chunks with a single copy (red), 2 > million+ chunks are now orange (2 copies) and the system is happily > increasing the 'green' 3 valid copy column. > > The problem is that it seems to be concentrating on the orange (2 copy) > files and ignoring the 10,000+ red ones that are most at risk. In the > last hour we've seen a few 'red' chunks > disappear but the vast majority of activity is occuring in the orange (2 > copy) column. > > Shouldn't the replication worry about the single copy files first? > > I also realize we could simply set the goal back to 2 let it finish that > up and THEN switch it to 3 but I'm curious as to what the community says. > Just a followup, after a day or so we still had "red" single copy chunks. Obviously the under goal routine doesn't look at how badly under goal a given chunk is. So we dropped the Goal back down to 2 and MFS immediately focused on the single copy chunks. The only problem observed was that shortly after dropping the goal back down to 2, the mount complained of connection issues and people were kicked out of their IMAP sessions. That condition returned to normal less than a minute later and no files were lost. Once the under goal of 2 was completed an hour or so later, we reset the Goal to 3 and in a few days we should be fully green. In the meantime, we have at least two copies and are not vulnerable to an additional failure. I would still suggest that the "under goal" routine might want to look first at those chunks that are more severely out of goal, then go back and fix the others. Assuming that doesn't impact overall performance. -bill ---------------------------------------------------------------------------- -- vRanger cuts backup time in half-while increasing security. With the market-leading solution for virtual backup and recovery, you get blazing-fast, flexible, and affordable data protection. Download your free trial now. http://p.sf.net/sfu/quest-d2dcopy1 _______________________________________________ moosefs-users mailing list moo...@li... https://lists.sourceforge.net/lists/listinfo/moosefs-users |
From: jose m. <let...@us...> - 2011-06-08 16:22:05
|
El mié, 08-06-2011 a las 11:10 +0200, Michal Borychowski escribió: > Hi! > > Thanks for pointing this out. Yes, it would be really important to make some > priorities in the rebalancing process which we put on our to-do list. > > We have this plan for prioritizing: > - high priority (chunks with one copy with goal>1; orange color in CGI > monitor) > - middle priority (chunks undergoal and overgoal; yellow and blue color) > - low priority (normal and deleted chunks; green and grey color) > - "special" priority (chunks with 0 copies; red color) > > What is your opinion about it? > > * Excellent, thank you. |
From: <wk...@bn...> - 2011-06-09 06:14:58
|
On 6/8/11 2:10 AM, Michal Borychowski wrote: > Hi! > > Thanks for pointing this out. Yes, it would be really important to make some > priorities in the rebalancing process which we put on our to-do list. > > We have this plan for prioritizing: > - high priority (chunks with one copy with goal>1; orange color in CGI > monitor) yes, the quicker we get back to 'at goal', the better. I've seen servers from the same purchase lot go down within days of each other so the more 'insurance' the better. > - middle priority (chunks undergoal and overgoal; yellow and blue color) > - low priority (normal and deleted chunks; green and grey color) > - "special" priority (chunks with 0 copies; red color) > > What is your opinion about it? looks fine though overgoal doesn' bother me that much <grin> . -bill |
From: rxknhe <rx...@gm...> - 2011-06-08 17:02:06
|
Hi there, If it can be given an option as 'on the fly tunable', that would be great. Here is the reason for asking. (1) We like conservative re-balancing approach of MooseFS, because in case of hardware failure or new chunkserver addition, MooseFS won't go crazy and start pounding disks and consuming I/O resource, thus causing i/o starvation for applications using MooseFS share. This works nicely most of the time. (2) However if we replace a dead chunkserver or mark area with '*' (i.e disk area to be replaced) then it takes several days to re-balance. With goals=2, this is a critical moment(when one chunkserver dies down), because if we loose one more chunk server (or disk area), we are on the verge of loosing data. In this case, if MooseFS can be told to speedup re-balancing and use more resources if needed. Hence we can live with goals=2 (i.e more usable space for data, instead of setting goals=3) comfortably. Depending upon nature of application served via MooseFS, admin can decide if willing to take risk with longer re-balance cycle and better i/o for applications or choose re-balance speedup and sacrifice i/o for application for a while. Although to begin with, your approach of speeding up based on "high priority (chunks with one copy with goal>1; orange color in CGI monitor)" and such, is a very welcome enhancement. regards rxknhe 2011/6/8 Michal Borychowski <mic...@ge...> > Hi! > > Thanks for pointing this out. Yes, it would be really important to make > some > priorities in the rebalancing process which we put on our to-do list. > > We have this plan for prioritizing: > - high priority (chunks with one copy with goal>1; orange color in CGI > monitor) > - middle priority (chunks undergoal and overgoal; yellow and blue color) > - low priority (normal and deleted chunks; green and grey color) > - "special" priority (chunks with 0 copies; red color) > > What is your opinion about it? > > > Kind regards > Michał Borychowski > MooseFS Support Manager > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > Gemius S.A. > ul. Wołoska 7, 02-672 Warszawa > Budynek MARS, klatka D > Tel.: +4822 874-41-00 > Fax : +4822 874-41-01 > > > > > -----Original Message----- > From: WK [mailto:wk...@bn...] > Sent: Thursday, May 26, 2011 2:41 AM > To: moo...@li... > Subject: Re: [Moosefs-users] Replication Priority? > > On 5/23/2011 3:32 PM, W Kern wrote: > > So now the CGI is showing 10,000+ chunks with a single copy (red), 2 > > million+ chunks are now orange (2 copies) and the system is happily > > increasing the 'green' 3 valid copy column. > > > > The problem is that it seems to be concentrating on the orange (2 copy) > > files and ignoring the 10,000+ red ones that are most at risk. In the > > last hour we've seen a few 'red' chunks > > disappear but the vast majority of activity is occuring in the orange (2 > > copy) column. > > > > Shouldn't the replication worry about the single copy files first? > > > > I also realize we could simply set the goal back to 2 let it finish that > > up and THEN switch it to 3 but I'm curious as to what the community says. > > > > Just a followup, after a day or so we still had "red" single copy > chunks. Obviously the under goal routine doesn't look at how badly under > goal a given chunk is. > > So we dropped the Goal back down to 2 and MFS immediately focused on the > single copy chunks. > > The only problem observed was that shortly after dropping the goal back > down to 2, the mount complained of connection issues and people were > kicked out of their IMAP sessions. > That condition returned to normal less than a minute later and no files > were lost. > > Once the under goal of 2 was completed an hour or so later, we reset the > Goal to 3 and in a few days we should be fully green. In the meantime, > we have at least two copies and are not vulnerable to an additional > failure. > > I would still suggest that the "under goal" routine might want to look > first at those chunks that are more severely out of goal, then go back > and fix the others. Assuming that doesn't impact overall performance. > > -bill > > > ---------------------------------------------------------------------------- > -- > vRanger cuts backup time in half-while increasing security. > With the market-leading solution for virtual backup and recovery, > you get blazing-fast, flexible, and affordable data protection. > Download your free trial now. > http://p.sf.net/sfu/quest-d2dcopy1 > _______________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users > > > > ------------------------------------------------------------------------------ > EditLive Enterprise is the world's most technically advanced content > authoring tool. Experience the power of Track Changes, Inline Image > Editing and ensure content is compliant with Accessibility Checking. > http://p.sf.net/sfu/ephox-dev2dev > _______________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users > |