From: Quenten G. <QG...@on...> - 2012-04-04 07:34:42
|
Hi Ben, Moose Users Thanks for your reply I've been thinking about using ZFS however as I understand some of the benefits of ZFS which are worth leveraging are data corruption prevention aka check summing of data via scrubs and compression. As I understand MFS for awhile now has check summing built into it? From the MFS fuse mount (across the network) all the way down to the disk level so whenever we access data its check summed which in it self is great, So this means that we don't "need" to use raid controllers for data protection and if we use a goal of 2 or more, so we are getting redundancy and data protection for little extra space. I've done some basic math on using ZFS, for example 4 chunk servers with 8 drives using 2TB drives. Using ZFS raidz2 with a goal of 2 vs single disks and a goal of 3. ZFS/GOAL2=24 Useable TB GOAL3=21 Useable TB So clearly there is a space saving here of around 3 TB using ZFS... Reliability, So on one hand with ZFS configuration anymore then 1 physical server or 5 disks fail at the same time within being replaced in the same time with in 2 chassis and our cluster is offline. VS GOAL A goal of 3 anymore then 2 disks with the same data set across the total number of servers or 2 physical servers fail at any one time and our cluster is effectively offline also keeping in mind, the chances of this happening would have to be pretty low as you increase your number of servers and drives.. SPEED... Raw speeds of a single SATA Disk around 75 IOPS and around 100mb/s throughput. Reliability, A RAIDZ2 I imagine we would achieve the speed of the 6 of 8 disks being 450 IOPS or 600mb/s per server GOAL With a Goal of 3, we would achieve a write of 75 IOPS & ~100mb/s per server. Single threads I think the ZFS system should be certainly faster throughput; however multiple threads the multiple paths in and out the goal of 3 I think would win. At this stage, it always seems like a trade off; of either reliability or performance pick one? So reviewing these examples the middle solution would be utilizing RaidZ1 with goal of 3, this would be the closest we could get to with performance and redundancy... This change's again when we look at scale, So now let's expand our servers to 40-80 servers using Raidz2, having 40 servers with 1 single volume and a goal of 3. Which 2 of the 40 servers could fail at anyone time and I wouldn't lose access to any data?, the chunks are effectively "randomly" placed among the cluster so I guess we would need to increase the over all goal by increasing utilized space usage once again for reliability..... non-raid/zfs setup for 40 servers/320 Hard Disks, 3 of which has my data on it, which 2 can fail without me losing access to my data :) So I guess this raises a few more questions which solution is the most effective... In the case of using ZFS raidz2/1 solutions What becomes the acceptable ratio of servers to GOAL from a reliability point of view or using individual disks/GOAL scaling the amount of servers would give us an increase of performance at the cost of reliability? Also from a performance point the higher the goal the more throughput however this may work against us if the cluster is "very busy" across all of the servers So I guess we are back to where we started we still have to pick one, Performance or Reliability? So any thoughts? Also thanks for reading, if you made it :) Regards, Quenten Grasso -----Original Message----- From: Allen, Benjamin S [mailto:bs...@la...] Sent: Wednesday, 4 April 2012 9:13 AM To: Quenten Grasso Cc: moo...@li... Subject: Re: [Moosefs-users] Backup strategies Quenten, I'm using MFS with ZFS. I use ZFS for RAIDZ2 (RAID6) and hot sparing on each chunkserver. I then only set a goal of 2 in MFS. I also have a "scratch" directory within MFS that is set to goal 1 and not backed up to tape. I attempt to get my users to organize their data between their data directory and scratch to minimize goal overhead for data that doesn't require it. Overhead of my particular ZFS setup is ~15% lost to parity and hot spares. Although I was a bit bold with my RAIDZ2 configuration, which will cause rebuild time to be quite long in trade off for lower overhead. This was done with the knowledge that RAIDZ2 can withstand two drive failures, and MFS would have another copy of the data on another chunk server. I have not however tested how well MFS handles a ZFS pool degraded with data loss. I'm guessing I would take the chunkserver daemon offline, get the ZFS pool into a rebuilding state, and restart the CS. I'm guessing the CS will see missing chunks, mark them undergoal, and re-replicate them. A more cautious RAID set would be closer to 30% overhead. Then of course with goal 2 you lose another 50%. A side benefit of using ZFS is on-the-fly compression and de-dup of your chunkserver, L2ARC SSD read cache (although it turns out most of my cache hits are from L1ARC, i.e. memory), and to speed up writes you can add a pair of ZIL SSDs. For disaster recovery you always need to be extra careful when relying on a single system todo your live and DR sites. In this case you're asking for MFS to push data to another site. You'd then be relying on a single piece of software that could equally corrupt your live site and your DR site. Ben On Apr 3, 2012, at 3:36 PM, Quenten Grasso wrote: > Hi All, > > How large is your metadata & logs at this stage? Just trying to mitigate this exact issue myself. > > I was planning to create hourly snapshots (as I understand the way they are implemented they don't affect performance unlike a vmware snapshot please correct me if I'm wrong) and copy these offsite to another mfs/cluster using rsync w/ snapshots on the other site with maybe a goal of 2 at most and using a goal of 3 on site. > > I guess the big issue here is storing our data 5 times in total vs. tapes however I guess it would be "quicker" to recover from a "failure" having a running cluster on site b vs a tape backup and dare i say it (possibly) more reliable then a singular tape and tape library. > > Also I've been tossing up the idea of using ZFS for storage, reason I say this is because I know mfs has built in check-summing/aka zfs and all that good stuff, however having to store our data 3 times + 2 times is expensive maybe storing it 2+1 instead would work out at scale by using the likes of ZFS for reliability then using mfs for purely for availability instead of reliability & availability as well... > > Would be great if there was away to use some kind of rack awareness to say at all times keep goal of 1 or 2 of the data offsite on our 2nd mfs cluster. When I was speaking to one of the staff of the mfs support team they mentioned this was kind of being developed for another customer, So we may see some kind of solution? > > Quenten > > -----Original Message----- > From: Allen, Benjamin S [mailto:bs...@la...] > Sent: Wednesday, 4 April 2012 7:17 AM > To: moo...@li... > Subject: Re: [Moosefs-users] Backup strategies > > Similar plan here. > > I have a dedicated server for MFS backup purposes. We're using IBM's Tivoli to push to a large GPFS archive system backed with a SpectraLogic tape library. I have the standard Linux Tivoli client running on this host. One key with Tivoli is to use the DiskCacheMethod, and set the disk cache to be somewhere on local disk instead of the root of the mfs mount. > > Also I backup mfsmaster's files every hour and retain at least a week of these backups. From the various horror stories we've heard on this mailing list, all have been from corrupt metadata files from mfsmaster. It's a really good idea to limit your exposure to this. > > For good measure I also backup metalogger's files every night. > > One dream for backup of MFS is to somehow utilize the metadata files dumped by mfsmaster or metalogger, to be able to do a metadata "diff". The goal of this process would be to produce a list of all objects in the filesystem that have changed between two metadata.mfs.back files. Thus you could feed your backup client a list of files, without having the need for the client to inspect the filesystem itself. This idea is inspired by ZFS' diff functionality. Where ZFS can show the changes between a snapshot and the live filesystem. > > Ben > > On Apr 3, 2012, at 2:18 PM, Atom Powers wrote: > >> I've been thinking about this for a while and I think occam's razor (the >> simplest ideas is the best) might provide some guidance. >> >> MooseFS is fault-tolerant; so you can mitigate "hardware failure". >> MooseFS provides a trash space, so you can mitigate "accidental >> deletion" events. >> MooseFS provides snapshots, so you can mitigate "corruption" events. >> >> The remaining scenario, "somebody stashes a nuclear warhead in the >> locker room", requires off-site backup. If "rack awareness" was able to >> guarantee chucks in multiple locations, then that would mitigate this >> event. Since it can't I'm going to be sending data off-site using a >> large LTO5 tape library managed by Bacula on a server that also runs >> mfsmount of the entire system. >> >> On 04/03/2012 12:56 PM, Steve Thompson wrote: >>> OK, so now you have a nice and shiny and absolutely massive MooseFS file >>> system. How do you back it up? >>> >>> I am using Bacula and divide the MFS file system into separate areas (eg >>> directories beginning with a, those beginning with b, and so on) and use >>> several different chunkservers to run the backup jobs, on the theory that >>> at least some of the data is local to the backup process. But this still >>> leaves the vast majority of data to travel the network twice (a planned >>> dedicated storage network has not yet been implemented). This results in >>> pretty bad backup performance and high network load. Any clever ideas? >>> >>> Steve >> >> -- >> -- >> Perfection is just a word I use occasionally with mustard. >> --Atom Powers-- >> Director of IT >> DigiPen Institute of Technology >> +1 (425) 895-4443 >> >> ------------------------------------------------------------------------------ >> Better than sec? Nothing is better than sec when it comes to >> monitoring Big Data applications. Try Boundary one-second >> resolution app monitoring today. Free. >> http://p.sf.net/sfu/Boundary-dev2dev >> _______________________________________________ >> moosefs-users mailing list >> moo...@li... >> https://lists.sourceforge.net/lists/listinfo/moosefs-users > > > ------------------------------------------------------------------------------ > Better than sec? Nothing is better than sec when it comes to > monitoring Big Data applications. Try Boundary one-second > resolution app monitoring today. Free. > http://p.sf.net/sfu/Boundary-dev2dev > _______________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users > > ------------------------------------------------------------------------------ > Better than sec? Nothing is better than sec when it comes to > monitoring Big Data applications. Try Boundary one-second > resolution app monitoring today. Free. > http://p.sf.net/sfu/Boundary-dev2dev > _______________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users |