You can subscribe to this list here.
2009 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(4) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2010 |
Jan
(20) |
Feb
(11) |
Mar
(11) |
Apr
(9) |
May
(22) |
Jun
(85) |
Jul
(94) |
Aug
(80) |
Sep
(72) |
Oct
(64) |
Nov
(69) |
Dec
(89) |
2011 |
Jan
(72) |
Feb
(109) |
Mar
(116) |
Apr
(117) |
May
(117) |
Jun
(102) |
Jul
(91) |
Aug
(72) |
Sep
(51) |
Oct
(41) |
Nov
(55) |
Dec
(74) |
2012 |
Jan
(45) |
Feb
(77) |
Mar
(99) |
Apr
(113) |
May
(132) |
Jun
(75) |
Jul
(70) |
Aug
(58) |
Sep
(58) |
Oct
(37) |
Nov
(51) |
Dec
(15) |
2013 |
Jan
(28) |
Feb
(16) |
Mar
(25) |
Apr
(38) |
May
(23) |
Jun
(39) |
Jul
(42) |
Aug
(19) |
Sep
(41) |
Oct
(31) |
Nov
(18) |
Dec
(18) |
2014 |
Jan
(17) |
Feb
(19) |
Mar
(39) |
Apr
(16) |
May
(10) |
Jun
(13) |
Jul
(17) |
Aug
(13) |
Sep
(8) |
Oct
(53) |
Nov
(23) |
Dec
(7) |
2015 |
Jan
(35) |
Feb
(13) |
Mar
(14) |
Apr
(56) |
May
(8) |
Jun
(18) |
Jul
(26) |
Aug
(33) |
Sep
(40) |
Oct
(37) |
Nov
(24) |
Dec
(20) |
2016 |
Jan
(38) |
Feb
(20) |
Mar
(25) |
Apr
(14) |
May
(6) |
Jun
(36) |
Jul
(27) |
Aug
(19) |
Sep
(36) |
Oct
(24) |
Nov
(15) |
Dec
(16) |
2017 |
Jan
(8) |
Feb
(13) |
Mar
(17) |
Apr
(20) |
May
(28) |
Jun
(10) |
Jul
(20) |
Aug
(3) |
Sep
(18) |
Oct
(8) |
Nov
|
Dec
(5) |
2018 |
Jan
(15) |
Feb
(9) |
Mar
(12) |
Apr
(7) |
May
(123) |
Jun
(41) |
Jul
|
Aug
(14) |
Sep
|
Oct
(15) |
Nov
|
Dec
(7) |
2019 |
Jan
(2) |
Feb
(9) |
Mar
(2) |
Apr
(9) |
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
(6) |
Oct
(1) |
Nov
(12) |
Dec
(2) |
2020 |
Jan
(2) |
Feb
|
Mar
|
Apr
(3) |
May
|
Jun
(4) |
Jul
(4) |
Aug
(1) |
Sep
(18) |
Oct
(2) |
Nov
|
Dec
|
2021 |
Jan
|
Feb
(3) |
Mar
|
Apr
|
May
|
Jun
|
Jul
(6) |
Aug
|
Sep
(5) |
Oct
(5) |
Nov
(3) |
Dec
|
2022 |
Jan
|
Feb
|
Mar
(3) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Wang J. <jia...@re...> - 2012-04-04 05:25:08
|
For desasters such as earthquake, fire, and flood, off-site backup is must-have, and any RAID level solution is sheer futile. As Atom Powers said, Moosefs should provide off-site backup mechanism. Months before, my colleague Ken Shao sent in some patches to provide "class" based goal mechanism, which enables us to define different "class" to differentiate physical location and backup data in other physical locations (i.e, 500km - 1000km away). The design principles are: 1. We can afford to lose some data during the backup point and disaster point. In this case, old data or old version of data are intact, new data or new version of data are lost. 2. Because cluster-to-cluster backup has many drawbacks (performance, consistency, etc), the duplication from one location to another location should be within a single cluster. 3. Location-to-location duplication should not happen when writing, or the performance/latency is hurt badly. So, the goal recovery mechanism can be and should be used (CS to CS duplication). And to improve bandwidth efficiency and avoid peek load time, duplication can be controlled in timely manner, and dirty/delta algorithm should be used. 4. Meta data should be logger to the backup site. When disaster happens, the backup site can be promoted to master site. The current rack awareness implementation is not the very thing we are looking forward to. Seriously speaking, as 10gb ether connection is getting cheaper and cheaper, the traditional rack awareness is rendered useless. 于 2012/4/4 7:13, Allen, Benjamin S 写道: > Quenten, > > I'm using MFS with ZFS. I use ZFS for RAIDZ2 (RAID6) and hot sparing on each chunkserver. I then only set a goal of 2 in MFS. I also have a "scratch" directory within MFS that is set to goal 1 and not backed up to tape. I attempt to get my users to organize their data between their data directory and scratch to minimize goal overhead for data that doesn't require it. > > Overhead of my particular ZFS setup is ~15% lost to parity and hot spares. Although I was a bit bold with my RAIDZ2 configuration, which will cause rebuild time to be quite long in trade off for lower overhead. This was done with the knowledge that RAIDZ2 can withstand two drive failures, and MFS would have another copy of the data on another chunk server. I have not however tested how well MFS handles a ZFS pool degraded with data loss. I'm guessing I would take the chunkserver daemon offline, get the ZFS pool into a rebuilding state, and restart the CS. I'm guessing the CS will see missing chunks, mark them undergoal, and re-replicate them. > > A more cautious RAID set would be closer to 30% overhead. > > Then of course with goal 2 you lose another 50%. > > A side benefit of using ZFS is on-the-fly compression and de-dup of your chunkserver, L2ARC SSD read cache (although it turns out most of my cache hits are from L1ARC, i.e. memory), and to speed up writes you can add a pair of ZIL SSDs. > > For disaster recovery you always need to be extra careful when relying on a single system todo your live and DR sites. In this case you're asking for MFS to push data to another site. You'd then be relying on a single piece of software that could equally corrupt your live site and your DR site. > > Ben > > On Apr 3, 2012, at 3:36 PM, Quenten Grasso wrote: > >> Hi All, >> >> How large is your metadata& logs at this stage? Just trying to mitigate this exact issue myself. >> >> I was planning to create hourly snapshots (as I understand the way they are implemented they don't affect performance unlike a vmware snapshot please correct me if I'm wrong) and copy these offsite to another mfs/cluster using rsync w/ snapshots on the other site with maybe a goal of 2 at most and using a goal of 3 on site. >> >> I guess the big issue here is storing our data 5 times in total vs. tapes however I guess it would be "quicker" to recover from a "failure" having a running cluster on site b vs a tape backup and dare i say it (possibly) more reliable then a singular tape and tape library. >> >> Also I've been tossing up the idea of using ZFS for storage, reason I say this is because I know mfs has built in check-summing/aka zfs and all that good stuff, however having to store our data 3 times + 2 times is expensive maybe storing it 2+1 instead would work out at scale by using the likes of ZFS for reliability then using mfs for purely for availability instead of reliability& availability as well... >> >> Would be great if there was away to use some kind of rack awareness to say at all times keep goal of 1 or 2 of the data offsite on our 2nd mfs cluster. When I was speaking to one of the staff of the mfs support team they mentioned this was kind of being developed for another customer, So we may see some kind of solution? >> >> Quenten >> >> -----Original Message----- >> From: Allen, Benjamin S [mailto:bs...@la...] >> Sent: Wednesday, 4 April 2012 7:17 AM >> To: moo...@li... >> Subject: Re: [Moosefs-users] Backup strategies >> >> Similar plan here. >> >> I have a dedicated server for MFS backup purposes. We're using IBM's Tivoli to push to a large GPFS archive system backed with a SpectraLogic tape library. I have the standard Linux Tivoli client running on this host. One key with Tivoli is to use the DiskCacheMethod, and set the disk cache to be somewhere on local disk instead of the root of the mfs mount. >> >> Also I backup mfsmaster's files every hour and retain at least a week of these backups. From the various horror stories we've heard on this mailing list, all have been from corrupt metadata files from mfsmaster. It's a really good idea to limit your exposure to this. >> >> For good measure I also backup metalogger's files every night. >> >> One dream for backup of MFS is to somehow utilize the metadata files dumped by mfsmaster or metalogger, to be able to do a metadata "diff". The goal of this process would be to produce a list of all objects in the filesystem that have changed between two metadata.mfs.back files. Thus you could feed your backup client a list of files, without having the need for the client to inspect the filesystem itself. This idea is inspired by ZFS' diff functionality. Where ZFS can show the changes between a snapshot and the live filesystem. >> >> Ben >> >> On Apr 3, 2012, at 2:18 PM, Atom Powers wrote: >> >>> I've been thinking about this for a while and I think occam's razor (the >>> simplest ideas is the best) might provide some guidance. >>> >>> MooseFS is fault-tolerant; so you can mitigate "hardware failure". >>> MooseFS provides a trash space, so you can mitigate "accidental >>> deletion" events. >>> MooseFS provides snapshots, so you can mitigate "corruption" events. >>> >>> The remaining scenario, "somebody stashes a nuclear warhead in the >>> locker room", requires off-site backup. If "rack awareness" was able to >>> guarantee chucks in multiple locations, then that would mitigate this >>> event. Since it can't I'm going to be sending data off-site using a >>> large LTO5 tape library managed by Bacula on a server that also runs >>> mfsmount of the entire system. >>> >>> On 04/03/2012 12:56 PM, Steve Thompson wrote: >>>> OK, so now you have a nice and shiny and absolutely massive MooseFS file >>>> system. How do you back it up? >>>> >>>> I am using Bacula and divide the MFS file system into separate areas (eg >>>> directories beginning with a, those beginning with b, and so on) and use >>>> several different chunkservers to run the backup jobs, on the theory that >>>> at least some of the data is local to the backup process. But this still >>>> leaves the vast majority of data to travel the network twice (a planned >>>> dedicated storage network has not yet been implemented). This results in >>>> pretty bad backup performance and high network load. Any clever ideas? >>>> >>>> Steve >>> -- >>> -- >>> Perfection is just a word I use occasionally with mustard. >>> --Atom Powers-- >>> Director of IT >>> DigiPen Institute of Technology >>> +1 (425) 895-4443 >>> >>> ------------------------------------------------------------------------------ >>> Better than sec? Nothing is better than sec when it comes to >>> monitoring Big Data applications. Try Boundary one-second >>> resolution app monitoring today. Free. >>> http://p.sf.net/sfu/Boundary-dev2dev >>> _______________________________________________ >>> moosefs-users mailing list >>> moo...@li... >>> https://lists.sourceforge.net/lists/listinfo/moosefs-users >> >> ------------------------------------------------------------------------------ >> Better than sec? Nothing is better than sec when it comes to >> monitoring Big Data applications. Try Boundary one-second >> resolution app monitoring today. Free. >> http://p.sf.net/sfu/Boundary-dev2dev >> _______________________________________________ >> moosefs-users mailing list >> moo...@li... >> https://lists.sourceforge.net/lists/listinfo/moosefs-users >> >> ------------------------------------------------------------------------------ >> Better than sec? Nothing is better than sec when it comes to >> monitoring Big Data applications. Try Boundary one-second >> resolution app monitoring today. Free. >> http://p.sf.net/sfu/Boundary-dev2dev >> _______________________________________________ >> moosefs-users mailing list >> moo...@li... >> https://lists.sourceforge.net/lists/listinfo/moosefs-users > > ------------------------------------------------------------------------------ > Better than sec? Nothing is better than sec when it comes to > monitoring Big Data applications. Try Boundary one-second > resolution app monitoring today. Free. > http://p.sf.net/sfu/Boundary-dev2dev > _______________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users > |
From: mARK b. <mb...@gm...> - 2012-04-03 23:35:21
|
Thanks, Dr. Chudobiak. I hadn't realized that man page existed. It does tell me what I need to know, which, unfortunately, as in Atom's case is not what I was hoping for. As someone mentioned in the backup strategies thread, it would be great for backup if files were distributed across all racks. That is, indeed, exactly the use I had in mind. > Date: Tue, 03 Apr 2012 15:45:47 -0400 > From: "Dr. Michael J. Chudobiak" <mj...@av...> > Subject: Re: [Moosefs-users] moosefs-users Digest, Vol 27, Issue 22 > > man mfstopology.cfg > > > ------------------------------ > > Date: Tue, 03 Apr 2012 12:42:01 -0700 > From: Atom Powers <ap...@di...> > Subject: Re: [Moosefs-users] moosefs-users Digest, Vol 27, Issue 22 > > I don't remember where I found the documentation, but I know I did. (man > page maybe...) > > From what I remember, "rack awareness" directs reads to the servers in > the same rack/group as the client but still distributes writes/chunks > "randomly". > > I was /hoping/ that it would distribute chunks according to rack/group, > and maybe become a mechanism to distribute data to multiple sites, but I > too was disappointed. Still, it is a valuable feature and maybe it will > be extended sometime later. -- mARK bLOORE <mb...@gm...> |
From: Allen, B. S <bs...@la...> - 2012-04-03 23:13:15
|
Quenten, I'm using MFS with ZFS. I use ZFS for RAIDZ2 (RAID6) and hot sparing on each chunkserver. I then only set a goal of 2 in MFS. I also have a "scratch" directory within MFS that is set to goal 1 and not backed up to tape. I attempt to get my users to organize their data between their data directory and scratch to minimize goal overhead for data that doesn't require it. Overhead of my particular ZFS setup is ~15% lost to parity and hot spares. Although I was a bit bold with my RAIDZ2 configuration, which will cause rebuild time to be quite long in trade off for lower overhead. This was done with the knowledge that RAIDZ2 can withstand two drive failures, and MFS would have another copy of the data on another chunk server. I have not however tested how well MFS handles a ZFS pool degraded with data loss. I'm guessing I would take the chunkserver daemon offline, get the ZFS pool into a rebuilding state, and restart the CS. I'm guessing the CS will see missing chunks, mark them undergoal, and re-replicate them. A more cautious RAID set would be closer to 30% overhead. Then of course with goal 2 you lose another 50%. A side benefit of using ZFS is on-the-fly compression and de-dup of your chunkserver, L2ARC SSD read cache (although it turns out most of my cache hits are from L1ARC, i.e. memory), and to speed up writes you can add a pair of ZIL SSDs. For disaster recovery you always need to be extra careful when relying on a single system todo your live and DR sites. In this case you're asking for MFS to push data to another site. You'd then be relying on a single piece of software that could equally corrupt your live site and your DR site. Ben On Apr 3, 2012, at 3:36 PM, Quenten Grasso wrote: > Hi All, > > How large is your metadata & logs at this stage? Just trying to mitigate this exact issue myself. > > I was planning to create hourly snapshots (as I understand the way they are implemented they don't affect performance unlike a vmware snapshot please correct me if I'm wrong) and copy these offsite to another mfs/cluster using rsync w/ snapshots on the other site with maybe a goal of 2 at most and using a goal of 3 on site. > > I guess the big issue here is storing our data 5 times in total vs. tapes however I guess it would be "quicker" to recover from a "failure" having a running cluster on site b vs a tape backup and dare i say it (possibly) more reliable then a singular tape and tape library. > > Also I've been tossing up the idea of using ZFS for storage, reason I say this is because I know mfs has built in check-summing/aka zfs and all that good stuff, however having to store our data 3 times + 2 times is expensive maybe storing it 2+1 instead would work out at scale by using the likes of ZFS for reliability then using mfs for purely for availability instead of reliability & availability as well... > > Would be great if there was away to use some kind of rack awareness to say at all times keep goal of 1 or 2 of the data offsite on our 2nd mfs cluster. When I was speaking to one of the staff of the mfs support team they mentioned this was kind of being developed for another customer, So we may see some kind of solution? > > Quenten > > -----Original Message----- > From: Allen, Benjamin S [mailto:bs...@la...] > Sent: Wednesday, 4 April 2012 7:17 AM > To: moo...@li... > Subject: Re: [Moosefs-users] Backup strategies > > Similar plan here. > > I have a dedicated server for MFS backup purposes. We're using IBM's Tivoli to push to a large GPFS archive system backed with a SpectraLogic tape library. I have the standard Linux Tivoli client running on this host. One key with Tivoli is to use the DiskCacheMethod, and set the disk cache to be somewhere on local disk instead of the root of the mfs mount. > > Also I backup mfsmaster's files every hour and retain at least a week of these backups. From the various horror stories we've heard on this mailing list, all have been from corrupt metadata files from mfsmaster. It's a really good idea to limit your exposure to this. > > For good measure I also backup metalogger's files every night. > > One dream for backup of MFS is to somehow utilize the metadata files dumped by mfsmaster or metalogger, to be able to do a metadata "diff". The goal of this process would be to produce a list of all objects in the filesystem that have changed between two metadata.mfs.back files. Thus you could feed your backup client a list of files, without having the need for the client to inspect the filesystem itself. This idea is inspired by ZFS' diff functionality. Where ZFS can show the changes between a snapshot and the live filesystem. > > Ben > > On Apr 3, 2012, at 2:18 PM, Atom Powers wrote: > >> I've been thinking about this for a while and I think occam's razor (the >> simplest ideas is the best) might provide some guidance. >> >> MooseFS is fault-tolerant; so you can mitigate "hardware failure". >> MooseFS provides a trash space, so you can mitigate "accidental >> deletion" events. >> MooseFS provides snapshots, so you can mitigate "corruption" events. >> >> The remaining scenario, "somebody stashes a nuclear warhead in the >> locker room", requires off-site backup. If "rack awareness" was able to >> guarantee chucks in multiple locations, then that would mitigate this >> event. Since it can't I'm going to be sending data off-site using a >> large LTO5 tape library managed by Bacula on a server that also runs >> mfsmount of the entire system. >> >> On 04/03/2012 12:56 PM, Steve Thompson wrote: >>> OK, so now you have a nice and shiny and absolutely massive MooseFS file >>> system. How do you back it up? >>> >>> I am using Bacula and divide the MFS file system into separate areas (eg >>> directories beginning with a, those beginning with b, and so on) and use >>> several different chunkservers to run the backup jobs, on the theory that >>> at least some of the data is local to the backup process. But this still >>> leaves the vast majority of data to travel the network twice (a planned >>> dedicated storage network has not yet been implemented). This results in >>> pretty bad backup performance and high network load. Any clever ideas? >>> >>> Steve >> >> -- >> -- >> Perfection is just a word I use occasionally with mustard. >> --Atom Powers-- >> Director of IT >> DigiPen Institute of Technology >> +1 (425) 895-4443 >> >> ------------------------------------------------------------------------------ >> Better than sec? Nothing is better than sec when it comes to >> monitoring Big Data applications. Try Boundary one-second >> resolution app monitoring today. Free. >> http://p.sf.net/sfu/Boundary-dev2dev >> _______________________________________________ >> moosefs-users mailing list >> moo...@li... >> https://lists.sourceforge.net/lists/listinfo/moosefs-users > > > ------------------------------------------------------------------------------ > Better than sec? Nothing is better than sec when it comes to > monitoring Big Data applications. Try Boundary one-second > resolution app monitoring today. Free. > http://p.sf.net/sfu/Boundary-dev2dev > _______________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users > > ------------------------------------------------------------------------------ > Better than sec? Nothing is better than sec when it comes to > monitoring Big Data applications. Try Boundary one-second > resolution app monitoring today. Free. > http://p.sf.net/sfu/Boundary-dev2dev > _______________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users |
From: Atom P. <ap...@di...> - 2012-04-03 21:49:36
|
On 04/03/2012 02:36 PM, Quenten Grasso wrote: > I was planning to create hourly snapshots (as I understand the way > they are implemented they don't affect performance unlike a vmware > snapshot please correct me if I'm wrong) and copy these offsite to > another mfs/cluster using rsync w/ snapshots on the other site with > maybe a goal of 2 at most and using a goal of 3 on site. Although snapshots don't increase the amount of storage used by the system, it effectively doubles the amount of metadata. For even medium-sized systems, making a snapshot of the complete system may actually decrease the security of the system by introducing problems with the amount of RAM and disk used by the metamaster. On my system, with about 7 million files and 3GB of metadata, doing a daily snapshot for a week requires some 22GB+ of additional RAM in the metamaster and metalogger. In other words, just because you /can/ do snapshots doesn't means you can do them without careful capacity planning. (And based on the number of people having issues with their metamaster I am very hesitant to recommend that strategy.) -- -- Perfection is just a word I use occasionally with mustard. --Atom Powers-- Director of IT DigiPen Institute of Technology +1 (425) 895-4443 |
From: Quenten G. <QG...@on...> - 2012-04-03 21:36:55
|
Hi All, How large is your metadata & logs at this stage? Just trying to mitigate this exact issue myself. I was planning to create hourly snapshots (as I understand the way they are implemented they don't affect performance unlike a vmware snapshot please correct me if I'm wrong) and copy these offsite to another mfs/cluster using rsync w/ snapshots on the other site with maybe a goal of 2 at most and using a goal of 3 on site. I guess the big issue here is storing our data 5 times in total vs. tapes however I guess it would be "quicker" to recover from a "failure" having a running cluster on site b vs a tape backup and dare i say it (possibly) more reliable then a singular tape and tape library. Also I've been tossing up the idea of using ZFS for storage, reason I say this is because I know mfs has built in check-summing/aka zfs and all that good stuff, however having to store our data 3 times + 2 times is expensive maybe storing it 2+1 instead would work out at scale by using the likes of ZFS for reliability then using mfs for purely for availability instead of reliability & availability as well... Would be great if there was away to use some kind of rack awareness to say at all times keep goal of 1 or 2 of the data offsite on our 2nd mfs cluster. When I was speaking to one of the staff of the mfs support team they mentioned this was kind of being developed for another customer, So we may see some kind of solution? Quenten -----Original Message----- From: Allen, Benjamin S [mailto:bs...@la...] Sent: Wednesday, 4 April 2012 7:17 AM To: moo...@li... Subject: Re: [Moosefs-users] Backup strategies Similar plan here. I have a dedicated server for MFS backup purposes. We're using IBM's Tivoli to push to a large GPFS archive system backed with a SpectraLogic tape library. I have the standard Linux Tivoli client running on this host. One key with Tivoli is to use the DiskCacheMethod, and set the disk cache to be somewhere on local disk instead of the root of the mfs mount. Also I backup mfsmaster's files every hour and retain at least a week of these backups. From the various horror stories we've heard on this mailing list, all have been from corrupt metadata files from mfsmaster. It's a really good idea to limit your exposure to this. For good measure I also backup metalogger's files every night. One dream for backup of MFS is to somehow utilize the metadata files dumped by mfsmaster or metalogger, to be able to do a metadata "diff". The goal of this process would be to produce a list of all objects in the filesystem that have changed between two metadata.mfs.back files. Thus you could feed your backup client a list of files, without having the need for the client to inspect the filesystem itself. This idea is inspired by ZFS' diff functionality. Where ZFS can show the changes between a snapshot and the live filesystem. Ben On Apr 3, 2012, at 2:18 PM, Atom Powers wrote: > I've been thinking about this for a while and I think occam's razor (the > simplest ideas is the best) might provide some guidance. > > MooseFS is fault-tolerant; so you can mitigate "hardware failure". > MooseFS provides a trash space, so you can mitigate "accidental > deletion" events. > MooseFS provides snapshots, so you can mitigate "corruption" events. > > The remaining scenario, "somebody stashes a nuclear warhead in the > locker room", requires off-site backup. If "rack awareness" was able to > guarantee chucks in multiple locations, then that would mitigate this > event. Since it can't I'm going to be sending data off-site using a > large LTO5 tape library managed by Bacula on a server that also runs > mfsmount of the entire system. > > On 04/03/2012 12:56 PM, Steve Thompson wrote: >> OK, so now you have a nice and shiny and absolutely massive MooseFS file >> system. How do you back it up? >> >> I am using Bacula and divide the MFS file system into separate areas (eg >> directories beginning with a, those beginning with b, and so on) and use >> several different chunkservers to run the backup jobs, on the theory that >> at least some of the data is local to the backup process. But this still >> leaves the vast majority of data to travel the network twice (a planned >> dedicated storage network has not yet been implemented). This results in >> pretty bad backup performance and high network load. Any clever ideas? >> >> Steve > > -- > -- > Perfection is just a word I use occasionally with mustard. > --Atom Powers-- > Director of IT > DigiPen Institute of Technology > +1 (425) 895-4443 > > ------------------------------------------------------------------------------ > Better than sec? Nothing is better than sec when it comes to > monitoring Big Data applications. Try Boundary one-second > resolution app monitoring today. Free. > http://p.sf.net/sfu/Boundary-dev2dev > _______________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users ------------------------------------------------------------------------------ Better than sec? Nothing is better than sec when it comes to monitoring Big Data applications. Try Boundary one-second resolution app monitoring today. Free. http://p.sf.net/sfu/Boundary-dev2dev _______________________________________________ moosefs-users mailing list moo...@li... https://lists.sourceforge.net/lists/listinfo/moosefs-users |
From: Allen, B. S <bs...@la...> - 2012-04-03 21:16:57
|
Similar plan here. I have a dedicated server for MFS backup purposes. We're using IBM's Tivoli to push to a large GPFS archive system backed with a SpectraLogic tape library. I have the standard Linux Tivoli client running on this host. One key with Tivoli is to use the DiskCacheMethod, and set the disk cache to be somewhere on local disk instead of the root of the mfs mount. Also I backup mfsmaster's files every hour and retain at least a week of these backups. From the various horror stories we've heard on this mailing list, all have been from corrupt metadata files from mfsmaster. It's a really good idea to limit your exposure to this. For good measure I also backup metalogger's files every night. One dream for backup of MFS is to somehow utilize the metadata files dumped by mfsmaster or metalogger, to be able to do a metadata "diff". The goal of this process would be to produce a list of all objects in the filesystem that have changed between two metadata.mfs.back files. Thus you could feed your backup client a list of files, without having the need for the client to inspect the filesystem itself. This idea is inspired by ZFS' diff functionality. Where ZFS can show the changes between a snapshot and the live filesystem. Ben On Apr 3, 2012, at 2:18 PM, Atom Powers wrote: > I've been thinking about this for a while and I think occam's razor (the > simplest ideas is the best) might provide some guidance. > > MooseFS is fault-tolerant; so you can mitigate "hardware failure". > MooseFS provides a trash space, so you can mitigate "accidental > deletion" events. > MooseFS provides snapshots, so you can mitigate "corruption" events. > > The remaining scenario, "somebody stashes a nuclear warhead in the > locker room", requires off-site backup. If "rack awareness" was able to > guarantee chucks in multiple locations, then that would mitigate this > event. Since it can't I'm going to be sending data off-site using a > large LTO5 tape library managed by Bacula on a server that also runs > mfsmount of the entire system. > > On 04/03/2012 12:56 PM, Steve Thompson wrote: >> OK, so now you have a nice and shiny and absolutely massive MooseFS file >> system. How do you back it up? >> >> I am using Bacula and divide the MFS file system into separate areas (eg >> directories beginning with a, those beginning with b, and so on) and use >> several different chunkservers to run the backup jobs, on the theory that >> at least some of the data is local to the backup process. But this still >> leaves the vast majority of data to travel the network twice (a planned >> dedicated storage network has not yet been implemented). This results in >> pretty bad backup performance and high network load. Any clever ideas? >> >> Steve > > -- > -- > Perfection is just a word I use occasionally with mustard. > --Atom Powers-- > Director of IT > DigiPen Institute of Technology > +1 (425) 895-4443 > > ------------------------------------------------------------------------------ > Better than sec? Nothing is better than sec when it comes to > monitoring Big Data applications. Try Boundary one-second > resolution app monitoring today. Free. > http://p.sf.net/sfu/Boundary-dev2dev > _______________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users |
From: Atom P. <ap...@di...> - 2012-04-03 20:19:03
|
I've been thinking about this for a while and I think occam's razor (the simplest ideas is the best) might provide some guidance. MooseFS is fault-tolerant; so you can mitigate "hardware failure". MooseFS provides a trash space, so you can mitigate "accidental deletion" events. MooseFS provides snapshots, so you can mitigate "corruption" events. The remaining scenario, "somebody stashes a nuclear warhead in the locker room", requires off-site backup. If "rack awareness" was able to guarantee chucks in multiple locations, then that would mitigate this event. Since it can't I'm going to be sending data off-site using a large LTO5 tape library managed by Bacula on a server that also runs mfsmount of the entire system. On 04/03/2012 12:56 PM, Steve Thompson wrote: > OK, so now you have a nice and shiny and absolutely massive MooseFS file > system. How do you back it up? > > I am using Bacula and divide the MFS file system into separate areas (eg > directories beginning with a, those beginning with b, and so on) and use > several different chunkservers to run the backup jobs, on the theory that > at least some of the data is local to the backup process. But this still > leaves the vast majority of data to travel the network twice (a planned > dedicated storage network has not yet been implemented). This results in > pretty bad backup performance and high network load. Any clever ideas? > > Steve -- -- Perfection is just a word I use occasionally with mustard. --Atom Powers-- Director of IT DigiPen Institute of Technology +1 (425) 895-4443 |
From: Steve T. <sm...@cb...> - 2012-04-03 19:56:09
|
OK, so now you have a nice and shiny and absolutely massive MooseFS file system. How do you back it up? I am using Bacula and divide the MFS file system into separate areas (eg directories beginning with a, those beginning with b, and so on) and use several different chunkservers to run the backup jobs, on the theory that at least some of the data is local to the backup process. But this still leaves the vast majority of data to travel the network twice (a planned dedicated storage network has not yet been implemented). This results in pretty bad backup performance and high network load. Any clever ideas? Steve -- ---------------------------------------------------------------------------- Steve Thompson, Cornell School of Chemical and Biomolecular Engineering smt AT cbe DOT cornell DOT edu "186,282 miles per second: it's not just a good idea, it's the law" ---------------------------------------------------------------------------- |
From: Dr. M. J. C. <mj...@av...> - 2012-04-03 19:45:54
|
On 04/03/2012 03:29 PM, mARK bLOORE wrote: > Thanks, Ahn, but I was asking what rack awareness IS, not whether it > is present. I tried it out, and it didn't do what I was hoping it > would. I don't know what it does, and I can find no documentation on > it. man mfstopology.cfg |
From: Atom P. <ap...@di...> - 2012-04-03 19:42:13
|
I don't remember where I found the documentation, but I know I did. (man page maybe...) From what I remember, "rack awareness" directs reads to the servers in the same rack/group as the client but still distributes writes/chunks "randomly". I was /hoping/ that it would distribute chunks according to rack/group, and maybe become a mechanism to distribute data to multiple sites, but I too was disappointed. Still, it is a valuable feature and maybe it will be extended sometime later. On 04/03/2012 12:29 PM, mARK bLOORE wrote: > Thanks, Ahn, but I was asking what rack awareness IS, not whether it > is present. I tried it out, and it didn't do what I was hoping it > would. I don't know what it does, and I can find no documentation on > it. > -- -- Perfection is just a word I use occasionally with mustard. --Atom Powers-- Director of IT DigiPen Institute of Technology +1 (425) 895-4443 |
From: mARK b. <mb...@gm...> - 2012-04-03 19:30:19
|
Thanks, Ahn, but I was asking what rack awareness IS, not whether it is present. I tried it out, and it didn't do what I was hoping it would. I don't know what it does, and I can find no documentation on it. > Date: Sat, 24 Mar 2012 13:58:53 +0700 > From: "Anh K. Huynh" <anh...@gm...> > Subject: Re: [Moosefs-users] what is rack awareness? > To: moo...@li... > Message-ID: <201...@gm...> > Content-Type: text/plain; charset=US-ASCII > > On Fri, 23 Mar 2012 19:36:18 -0400 > mARK bLOORE <mb...@gm...> wrote: > >> is there a description of what "rack awareness" means in mfs? i did a >> simple test of the assumption (and hope) that files would be >> distributed to have at least one copy in each server group, but that >> did not happen. in fact the files i created started out with copies >> randomly distributed among chunk servers, but ended up all in one >> server group. >> > > In the latest version 2.6.24, there is a support. You can check in > "topology" configuration (distributed in the default installation of > MFS 1.6.24). I've never tried but I think that's what you're looking > for. There isn't similiar support in any previous version. > > Regards, > > -- > "It is better for civilization to be going down the drain than to be > coming up it." > -- Henry Allen > -- mARK bLOORE <mb...@gm...> |
From: Travis H. <tra...@tr...> - 2012-04-02 15:39:10
|
The Moosefs volume does not have a hard maximum size restriction. Generally the MFS master process requires more RAM proportional to the number of files stored within the moosefs volume (approximately 300MB per million files) (see the Documentation page for more details on sizing). There was a 2TB max (individual) file limitation (see the FAQ). This has been removed as of 2.6.24 release, now increased to 128 PB (see release notes). On 12-04-01 11:28 AM, lx wrote: > Hello! > Moosefs client is a single mount point to support the maximum capacity? > For example three chunkserver 3 × 2TB, the client is a single mount > point is close to 6TB? > Thank you for your answers. > liuxiang > 2012/4/1 |
From: lx <sol...@16...> - 2012-04-01 15:28:19
|
Hello! Moosefsclientisa singlemount pointtosupportthemaximum capacity? For examplethreechunkserver3×2TB,the clientisa singlemount pointis closeto6TB? Thankyouforyouranswers. liuxiang 2012/4/1 |
From: Sébastien M. <seb...@gm...> - 2012-04-01 01:15:55
|
Hi, I lost my data this morning. I'm using moosefs for over 10 months and never had such a problem. I have two servers (debian stable) one is mfsmaster (file03) and the other one mfschunkserver. Both have 2Go of memory and have chunkserver of about 4To. I got the following message in syslog : Mar 31 07:09:00 file03 mfsmaster[3182]: total: usedspace: 7111203336192 (6622.82 GiB), totalspace: 9094555459584 (8469.96 GiB), usage: 78.19% Mar 31 07:10:00 file03 mfsmaster[3182]: chunkservers status: Mar 31 07:10:00 file03 mfsmaster[3182]: server 1 (ip: 192.168.0.182, port: 9422): usedspace: 3545542049792 (3302.04 GiB), totalspace: 3616382492672 (3368.02 GiB), usage: 98.04% Mar 31 07:10:00 file03 mfsmaster[3182]: server 2 (ip: 192.168.0.181, port: 9422): usedspace: 3565661286400 (3320.78 GiB), totalspace: 5478172966912 (5101.95 GiB), usage: 65.09% Mar 31 07:10:00 file03 mfsmaster[3182]: total: usedspace: 7111203336192 (6622.82 GiB), totalspace: 9094555459584 (8469.96 GiB), usage: 78.19% Mar 31 07:10:00 file03 mfsmaster[3182]: connection with CS(192.168.0.190) has been closed by peer Mar 31 07:10:00 file03 mfsmaster[3182]: chunkserver disconnected - ip: 192.168.0.190, port: 0, usedspace: 0 (0.00 GiB), totalspace: 0 (0.00 GiB) Mar 31 07:10:58 file03 kernel: [849836.227866] mfsmaster: page allocation failure. order:5, mode:0x4020 Mar 31 07:10:58 file03 kernel: [849836.227872] Pid: 3182, comm: mfsmaster Not tainted 2.6.32-5-686 #1 Mar 31 07:10:59 file03 kernel: [849837.670014] mfsmaster: page allocation failure. order:5, mode:0x4020 Mar 31 07:10:59 file03 kernel: [849837.670021] Pid: 3182, comm: mfsmaster Not tainted 2.6.32-5-686 #1 Mar 31 07:11:00 file03 mfsmaster[3182]: chunkservers status: Mar 31 07:11:00 file03 mfsmaster[3182]: server 1 (ip: 192.168.0.182, port: 9422): usedspace: 3545542049792 (3302.04 GiB), totalspace: 3616382492672 (3368.02 GiB), usage: 98.04% Mar 31 07:11:00 file03 mfsmaster[3182]: server 2 (ip: 192.168.0.181, port: 9422): usedspace: 3565661286400 (3320.78 GiB), totalspace: 5478172966912 (5101.95 GiB), usage: 65.09% Mar 31 07:11:00 file03 mfsmaster[3182]: total: usedspace: 7111203336192 (6622.82 GiB), totalspace: 9094555459584 (8469.96 GiB), usage: 78.19% Mar 31 07:11:05 file03 kernel: [849843.214701] mfsmaster: page allocation failure. order:5, mode:0x4020 Mar 31 07:11:05 file03 kernel: [849843.214707] Pid: 3182, comm: mfsmaster Not tainted 2.6.32-5-686 #1 Mar 31 07:11:13 file03 kernel: [849851.464014] mfsmaster: page allocation failure. order:5, mode:0x4020 Mar 31 07:11:13 file03 kernel: [849851.464021] Pid: 3182, comm: mfsmaster Not tainted 2.6.32-5-686 #1 Mar 31 07:11:24 file03 kernel: [849862.732083] mfsmaster: page allocation failure. order:5, mode:0x4020 Mar 31 07:11:24 file03 kernel: [849862.732088] Pid: 3182, comm: mfsmaster Not tainted 2.6.32-5-686 #1 Mar 31 07:11:25 file03 kernel: [849863.626723] mfsmaster: page allocation failure. order:5, mode:0x4020 Mar 31 07:11:25 file03 kernel: [849863.626729] Pid: 3182, comm: mfsmaster Not tainted 2.6.32-5-686 #1 Mar 31 07:11:27 file03 kernel: [849865.858301] mfsmaster: page allocation failure. order:5, mode:0x4020 Mar 31 07:11:27 file03 kernel: [849865.858307] Pid: 3182, comm: mfsmaster Not tainted 2.6.32-5-686 #1 Mar 31 07:11:31 file03 kernel: [849869.633258] mfsmaster: page allocation failure. order:5, mode:0x4020 Mar 31 07:11:31 file03 kernel: [849869.633264] Pid: 3182, comm: mfsmaster Not tainted 2.6.32-5-686 #1 I'm using kernel 2.6.32-5-686. The mfsmaster.mfs.back was totaly corrupted (hi weight was about 100Mo when he was 400 Mo a few days ago and I din't remove so many files) and mfsmetarestore -a or mfsmetarestore -m mfsmetarestore.mfs.back changelog.*.mfs make a segmentation fault. Same problem on the mfsmetalogger since he copy the corrupted mfsmetadata.mfs.back file. How this can happen? I found ou an old copy of mfsmetarestore (10 days ago) and I started from that losing 10 days of production. Now I'm worried, the server is up, but the mfsmetadata.mfs.back hasn't been writen since this morning. He has exaclty the same md5 signature than this morning : root@file03:/var/lib/mfs# ls -l metadata.mfs.back.tmp metadata.mfs.back -rw-r----- 1 root root 412946488 mars 31 10:55 metadata.mfs.back -rw-r--r-- 1 root root 412946488 mars 21 10:27 metadata.mfs.back.tmp root@file03:/var/lib/mfs# md5sum metadata.mfs.back.tmp metadata.mfs.back d1e7c51a5f8a752dc18aa645552165e7 metadata.mfs.back.tmp d1e7c51a5f8a752dc18aa645552165e7 metadata.mfs.back So it looks like my metadata are not dumped ... Please help. Regards, Sébastien |
From: Corin L. <in...@co...> - 2012-03-31 18:06:53
|
Hi, has anybody tried using the new ploop storage for openvz images ( http://wiki.openvz.org/Ploop) together with moosefs? ploop mount -d /dev/ploop0 /mfs-mount/root.hdd Adding delta dev=/dev/ploop0 img=/mfs-mount/root.hdd (rw) PLOOP_IOC_ADD_DELTA /mfs-mount/root.hdd: Invalid argument In syslog I find: kernel: File on FS without backing device I thought it might be because ploop needs direct-io, but mounting with direct-io enabled seems not to be supported by moosefs: mfsmount -o direct_io /mfs-mount mfsmaster accepted connection with parameters: read-write,restricted_ip ; root mapped to root:root fuse: unknown option `direct_io' Has anybody got it working somehow? :) Btw: I'll also post this question to the openvz mailinglist as I'm pretty sure the devs there might have interesting answers too. Corin |
From: Wenhua Z. <shi...@gm...> - 2012-03-31 02:28:42
|
Hi all, When we did write test to MFS, I stopped all the chunkservers. After restarted the chunkservers, I found "damaged" error in the CGI-page and many "invalid copies" error form the master's log. We have 4 chunkservers, and other 5 servers. These 9 servers have mfsmount running and write data to mounted folder(I have just changed the goal form 1 to 3 about 2 hours ago). Before I stop the chunkservers, the inflow-rate of the mfs is about 40M Bytes/sec. 1. Errors in logs "chunk invalid copies" errors such as below from the mfsmaster log: Mar 29 17:07:52 XXX-22 mfsmaster[7192]: chunk 00000000000EC3A8 has only invalid copies (2) - please repair it manually Mar 29 17:07:52 XXX-22 mfsmaster[7192]: chunk 00000000000EC3A8_00000002 - invalid copy on (10.7.17.54 - ver:00000001) Mar 29 17:07:52 XXX-22 mfsmaster[7192]: chunk 00000000000EC3A8_00000002 - invalid copy on (10.7.17.86 - ver:00000000) ...... Mar 29 17:07:54 XXX-22 mfsmaster[7192]: chunk 00000000000EC3BF has only invalid copies (1) - please repair it manually Mar 29 17:07:54 XXX-22 mfsmaster[7192]: chunk 00000000000EC3BF_00000003 - invalid copy on (10.7.17.54 - ver:00000001) ...... Besides that, we also found some other errors form the chunkserver's log: Mar 29 17:07:21 XXX-54 mfsmount[1883]: file: 170882, index: 31 -fs_writechunk returns status 8 ... Mar 29 17:07:43 XXX-85 mfschunkserver[26178]: write_block_to_chunk: file:/data2/mfsdata/84/chunk_00000000000EC384_00000002.mfs - crc error ... Mar 29 17:07:43 XXX-85 mfsmount[6604]: writeworker: write error: 29 ...... Mar 29 17:07:44 XXX-85 mfsmount[6604]: writeworker: write error: 13 ...... Mar 29 17:07:44 XXX-85 mfsmount[6604]: writeworker: write error: 28 The error number in the codes: "#define ERROR_CHUNKLOST 8 // Chunk lost" "#define ERROR_NOCHUNK 13 // No such chunk" "#define ERROR_DISCONNECTED 28 // Disconnected" "#define ERROR_CRC 29 // CRC error" We got some more informations about the chunk form the chunkserver and found that the error chunks have more than one copies but their versions were not same. eg: chunk-00000000000EC3A8: The mfsmaster log: Mar 29 22:42:01 XXX-22 mfsmaster[7192]: chunk 00000000000EC3A8 has only invalid copies (2) - please repair it manually Mar 29 22:42:01 XXX-22 mfsmaster[7192]: chunk 00000000000EC3A8_00000003 - invalid copy on (10.7.17.86 - ver:00000002) Mar 29 22:42:01 XXX-22 mfsmaster[7192]: chunk 00000000000EC3A8_00000003 - invalid copy on (10.7.17.54 - ver:00000001) The chunkserver' log: Mar 29 17:07:43 XXX-85 mfschunkserver[26178]: write_block_to_chunk: file:/data8/mfsdata/A8/chunk_00000000000EC3A8_00000002.mfs - crc error Mar 29 17:31:24 XXX-85 mfschunkserver[15680]: write_block_to_chunk: file:/data8/mfsdata/A8/chunk_00000000000EC3A8_00000003.mfs - crc error Mar 29 17:07:43 XXX-86 mfschunkserver[8547]: write_block_to_chunk: file:/data9/mfsdata/A8/chunk_00000000000EC3A8_00000002.mfs - crc error The file in the chunkserver (54, 85 and 86): 54: 41096192 Mar 29 17:07 chunk_00000000000EC3A8_00000001.mfs 85: 41096192 Mar 29 17:31 chunk_00000000000EC3A8_00000003.mfs 86: 41096192 Mar 29 17:07 chunk_00000000000EC3A8_00000002.mfs md5 value of the files: 7bd65382eb63db86d5b68395ae546f40 /data3/mfsdata/A8/chunk_00000000000EC3A8_00000001.mfs aa8f3bab55dfbf3f7a2dbd42993e4e51 /data8/mfsdata/A8/chunk_00000000000EC3A8_00000003.mfs 9101e3feb0ecaea386afe0500df56941 /data9/mfsdata/A8/chunk_00000000000EC3A8_00000002.mfs In fact, this chunk is part of the file "/mnt/mfs/test/p/20120329/0000000c/0000027e" : /mnt/mfs/test/p/20120329/0000000c/0000027e: chunk 0: 00000000000EC181_00000001 / (id:967041 ver:1) copy 1: 10.7.17.54:9422 copy 2: 10.7.17.85:9422 copy 3: 10.7.17.86:9422 chunk 1: 00000000000EC1F3_00000001 / (id:967155 ver:1) copy 1: 10.7.17.54:9422 copy 2: 10.7.17.55:9422 copy 3: 10.7.17.86:9422 ...... chunk 6: 00000000000EC3A8_00000003 / (id:967592 ver:3) no valid copies !!! When we use mfsfileinfo command , mfsmount will send a message "MATOCU_FUSE_READ_CHUNK" to the master. If the chunk of the file is not correct, the response from master will not contain the information we suppose to get, and "no valid copies !!!" will be printed(such as chunk 6: 00000000000EC3A8_00000003). 2. Question 1). Till now, I think the main cause of the "invalid copy" error is the chunk-version conflict, am I right? But my doubt is that when will the chunk-version make changes. Thanks. Form the logs, we find many files which chunk's version is not 1, but 2, 3 or even 7. eg: chunk 0: 00000000000D5394_00000003 / (id:873364 ver:3) copy 1: 10.7.17.54:9422 copy 2: 10.7.17.85:9422 copy 3: 10.7.17.86:9422 chunk 1: 00000000000D5505_00000003 / (id:873733 ver:3) copy 1: 10.7.17.55:9422 copy 2: 10.7.17.85:9422 copy 3: 10.7.17.86:9422 chunk 2: 00000000000D55F3_00000007 / (id:873971 ver:7) copy 1: 10.7.17.54:9422 copy 2: 10.7.17.55:9422 copy 3: 10.7.17.86:9422 ...... 2). What will happen to the files awaiting to be saved when the chunkserver goes down while mfsmount is already running? And when restart the chunkserver, is there any influence on the saved files? (eg, when all the chunkservers power off, maybe including the master) According to "http://www.moosefs.org/moosefs-faq.html#master-online", when the master server goes down while mfsmount is already running, mfsmount doesn't disconnect the mounted resource and files awaiting to be saved would stay quite long in the queue while trying to reconnect to the master server. 3). As we know, if we want to stop one chunkserver or remove one HD of the chunkserver, we have to do as " http://www.moosefs.org/moosefs-faq.html#add_remove". It will be a long time and many steps before we can remove the chunkserver or its disks, is there any other better method? We think we can set a "access-level" value to chunkserver, only when the chunkserver's access-level is set to be "WRITE", we can write data to it, otherwise the chunkserver is READ-ONLY. So after this has been implemented, we could set the access-level of the chunkserver to be "READ-ONLY" when we want to stop the chunkserver. But till now, we are not sure if this method will work well, and we need do some more tests. Do you have any ideas about this or you have some better solutions? Thanks. 4). When one disk of the chunkserver is marked "damaged" in the CGI monitor, does it means that this disk is read-only? And what causes the chunkserver to be marked "damaged"? 5). In fact, we can find the error file according to the logs, such as "/mnt/mfs/test/p/20120329/0000000c/0000027e" above. After I try to repair this file with "mfsfilerepair", the version of the chunk "00000000000EC3A8" changed to be 2, not 3. What the difference between 00000000000EC3A8_00000002 and 00000000000EC3A8_00000003? According to the MD5 value of these two files, their content are not same, so is there any data lost after this mfsfilerepair operation? #mfsfileinfo /mnt/mfs/test/p/20120329/0000000c/0000027e /mnt/mfs/test/p/20120329/0000000c/0000027e: chunk 0: 00000000000EC181_00000001 / (id:967041 ver:1) copy 1: 10.7.17.54:9422 copy 2: 10.7.17.85:9422 copy 3: 10.7.17.86:9422 ...... chunk 6: 00000000000EC3A8_00000002 / (id:967592 ver:2) copy 1: 10.7.17.86:9422 Thanks, Best Wishes, Wenhua |
From: Quenten G. <QG...@on...> - 2012-03-31 01:46:12
|
Hi Steve I'm also in the middle of building a simular configuration. Did you happen to consider using maybe FreeBSD with ZFS and a couple of smallish SSD for logs? After thinking about it this morning I imagine this could considerably lower the fsync speed issues. Also as a side note I went though my small test cluster which is 6 machines with 2 disks each (1ru servers) and replaced all of the disks which seemed to have a higher then average fsync than the other disks and this increased my clusters performance considerably and I'm not currently running any raid. I guess this may go without saying however thought I might mention it :) Quenten Grasso -----Original Message----- From: Steve Thompson [mailto:sm...@cb...] Sent: Saturday, 31 March 2012 4:33 AM To: Chris Picton Cc: moo...@li... Subject: Re: [Moosefs-users] SSDs On Fri, 30 Mar 2012, Chris Picton wrote: > Do those servers have bettery backed cache on the raid? If so, are they set > to write-back or write-through? > > I have found that when running on standard SATA disks, the constant fsync is > what slows things down tremendously. A battery backed write-back cache on > the raid card would help a lot there. Yes, the controllers (Perc 5's and Perc 6's) have battery backup, and the virtual disks are set to write back. Indeed, write through (such as when the battery is doing a learn cycle) is a great deal slower. Steve ------------------------------------------------------------------------------ This SF email is sponsosred by: Try Windows Azure free for 90 days Click Here http://p.sf.net/sfu/sfd2d-msazure _______________________________________________ moosefs-users mailing list moo...@li... https://lists.sourceforge.net/lists/listinfo/moosefs-users |
From: Steve T. <sm...@cb...> - 2012-03-30 18:33:00
|
On Fri, 30 Mar 2012, Chris Picton wrote: > Do those servers have bettery backed cache on the raid? If so, are they set > to write-back or write-through? > > I have found that when running on standard SATA disks, the constant fsync is > what slows things down tremendously. A battery backed write-back cache on > the raid card would help a lot there. Yes, the controllers (Perc 5's and Perc 6's) have battery backup, and the virtual disks are set to write back. Indeed, write through (such as when the battery is doing a learn cycle) is a great deal slower. Steve |
From: Chris P. <ch...@ec...> - 2012-03-30 18:25:30
|
On 2012/03/30 7:41 PM, Steve Thompson wrote: > On Fri, 30 Mar 2012, Ricardo J. Barberis wrote: > >> Do those drives happen to have 4 KB physical block size? >> That combined with unaligned partitions could explain such bad performance. > I always use whole disks combined into RAID sets via hardware raid, so > partition alignment is not an issue, and then use ext4 file systems as > chunk volumes. Raw I/O performance of the volumes are excellent, and > indeed the I/O performance numbers shown by the MFS cgi script are also > excellent (both read and write are greater than gigabit bandwidth). I use > high-end Dell servers with 3+ GHz processors and dual bonded gigabit links > for I/O, but nevertheless the resulting MFS I/O performance, which was > good initially when I built a testing setup, has just plummeted with a > real-world I/O load on it, to the point where I am getting a lot of > complaints. This looks like a show stopper to me, which is somewhat > upsetting. I'm not sure what I can do at this point to tune it further. Hi Steve Do those servers have bettery backed cache on the raid? If so, are they set to write-back or write-through? I have found that when running on standard SATA disks, the constant fsync is what slows things down tremendously. A battery backed write-back cache on the raid card would help a lot there. Chris |
From: Steve T. <sm...@cb...> - 2012-03-30 17:41:53
|
On Fri, 30 Mar 2012, Ricardo J. Barberis wrote: > Do those drives happen to have 4 KB physical block size? > That combined with unaligned partitions could explain such bad performance. I always use whole disks combined into RAID sets via hardware raid, so partition alignment is not an issue, and then use ext4 file systems as chunk volumes. Raw I/O performance of the volumes are excellent, and indeed the I/O performance numbers shown by the MFS cgi script are also excellent (both read and write are greater than gigabit bandwidth). I use high-end Dell servers with 3+ GHz processors and dual bonded gigabit links for I/O, but nevertheless the resulting MFS I/O performance, which was good initially when I built a testing setup, has just plummeted with a real-world I/O load on it, to the point where I am getting a lot of complaints. This looks like a show stopper to me, which is somewhat upsetting. I'm not sure what I can do at this point to tune it further. Steve -- ---------------------------------------------------------------------------- Steve Thompson, Cornell School of Chemical and Biomolecular Engineering smt AT cbe DOT cornell DOT edu "186,282 miles per second: it's not just a good idea, it's the law" ---------------------------------------------------------------------------- |
From: Dr. M. J. C. <mj...@av...> - 2012-03-30 16:34:26
|
On 03/30/2012 12:10 PM, Ricardo J. Barberis wrote: > Wow, those are really awful numbers for the 1 TB drives. > > Do those drives happen to have 4 KB physical block size? > That combined with unaligned partitions could explain such bad performance. I'm not sure. I had wanted to move to SSDs anyway, so this was the push I needed. I didn't explore optimizing the 1TB disks in detail. - Mike |
From: Brent A N. <br...@ph...> - 2012-03-30 16:15:32
|
On Fri, 30 Mar 2012, Michał Borychowski wrote: > Hi! > > Error 11 means "chunk locked". It appears when several processes at > different computers try to write to the same file in parallel. Huh, I don't see a reason any other machines or even different mfsmounts on the same machine would have written to my google-chrome cache files at the same time. I'm only running google-chrome on one machine (indeed, it prevents me from starting another instance accidentally on a different machine at the same time, and I checked through all of our computers for stray processes). I do have a cron job running on a different machine to purge old cache files, although I think that was added after my last errors. Would a file deletion even lock a chunk for writing, or would that be purely a metadata operation? How about an atime update (I would think that would have to be a metadata server operation and would not involve the chunks in any way)? Could the servers, in their internal maintenance schedule, be briefly locking the chunks for other reasons? I still wonder if Chrome might be tickling some corner case every now and then and producing these messages with the involvement of just one mfsmount... > You should not be bothered by this message but it's wise to minimize > occurrences of this situation. Indeed, I've observed no problems resulting from this warning, apart from it generating enough complaints to fill a 1.2GB /var after awhile. When it happens, it happens in large bursts. By the way, so far, I've seen no further warnings since disabling the google-chrome cache (which seems much faster, anyway). So far, so good. Thanks, Brent |
From: Ricardo J. B. <ric...@da...> - 2012-03-30 16:10:44
|
El Viernes 30/03/2012, Dr. Michael J. Chudobiak escribió: > On 03/30/2012 06:49 AM, Michał Borychowski wrote: > > Hi Michael! > > > > Do you use only SSD drives in the chunkservers? Maybe you would like to > > share some speed tests with the users on the group? > > > > And do you have SSD disk in the master server? How big is your metadata? > > Have you noticed any improvements? > > Michał, > > I have a small moosefs system holding ~400 GB of data, including user's > home folders. > > The master server always had SSD disks. metadata.mfs is only ~50 MB. > > Performance was quite disappointing until I removed the two 1TB hard > drives in the chunk servers and replace them with four 600 GB SSDs. The > improvement in performance was HUGE. For a small system, they are > definitely worth the cost. > > Here is a quick test I did in a live system, comparing a 600 GB SSD in > one chunkserver with a 1TB hard drive in the other: > > http://i.imgur.com/J0wxz.png Wow, those are really awful numbers for the 1 TB drives. Do those drives happen to have 4 KB physical block size? That combined with unaligned partitions could explain such bad performance. Check for example: http://www.ibm.com/developerworks/linux/library/l-4kb-sector-disks/index.html?ca=dgr-lnxw074KB-Disksdth-LX Regards, -- Ricardo J. Barberis Senior SysAdmin / ITI Dattatec.com :: Soluciones de Web Hosting Tu Hosting hecho Simple! ------------------------------------------ |
From: Corin L. <in...@co...> - 2012-03-30 12:26:00
|
Hi Michal, Am 30.03.2012 13:00, schrieb Michał Borychowski: > We already thought about O_DIRECT some time ago but even Linus Torvalds > advises against using it (citation below) and probably it would decrease the > chunkserver performance. > > On the other hand it would quite easy to implement O_DIRECT, but we would > need some strong arguments for doing this. Can you share some with us? > My only intent is to avoid double buffering. I'm using moosefs for storing (big) vserver images only. Every physical machine in my cluster acts as a chunkserver as well as a host for some of the vservers. I assume double buffering occurs here without using o_direct for chunkservers, since the chunkservers cache the images (and a few additional bytes the chunkservers use for metadata) and the hosts (vservers) mounting the images cache them again. I assume it'd be much better in terms of performance and memory usage to have the data cached on the client side, where the image is mounted/ used, only. This is especially a concern when a vserver's image is on the same host as the chunkserver serving it. Is this a real concern or am I missing something? :) Thanks, Corin |
From: Dr. M. J. C. <mj...@av...> - 2012-03-30 11:53:44
|
On 03/30/2012 06:49 AM, Michał Borychowski wrote: > Hi Michael! > > Do you use only SSD drives in the chunkservers? Maybe you would like to > share some speed tests with the users on the group? > > And do you have SSD disk in the master server? How big is your metadata? > Have you noticed any improvements? Michał, I have a small moosefs system holding ~400 GB of data, including user's home folders. The master server always had SSD disks. metadata.mfs is only ~50 MB. Performance was quite disappointing until I removed the two 1TB hard drives in the chunk servers and replace them with four 600 GB SSDs. The improvement in performance was HUGE. For a small system, they are definitely worth the cost. Here is a quick test I did in a live system, comparing a 600 GB SSD in one chunkserver with a 1TB hard drive in the other: http://i.imgur.com/J0wxz.png Both chunkservers were on similar network connections (gigabit ethernet, same switch, jumbo frames). I think LibreOffice was having trouble with the very long fsync times reported on the hard drives, particularly when accessing ~/.libreoffice. I don't know why the fsync times were so dreadful, but the SSDs made that issue go away entirely. Perhaps I could have tweaked something with hdparm, but it was more practical to just swap in the SSDs. The current system is excellent, robust, stable, and even works well with all those applications that use sqlite files, like Firefox and Thunderbird (troublesome on all versions of nfs). For a future feature enhancement, you might consider allowing the admin to specify that certain folders - like /fileserver/home - be assigned to chunks on certain disks, so that home folders could go on SSDs, while bulk data goes on slower disks. - Mike |