Re: [Dar-support] dar without recursion
For full, incremental, compressed and encrypted backups or archives
Brought to you by:
edrusb
|
From: Jean-Baptiste D. <jb...@pa...> - 2022-05-22 20:13:10
|
Hi Per, Hi Denis, Per, thank you for the suggestion. I tried something approaching without luck before asking the list ;) Denis, thank you very much for the detailed explanation. Your first suggestion looks perfect to me. It's elegant and almost obvious after reading it ;) Do you consider this solution not perfect because of the (empty) directories creation or something else ? I can live with that. I cannot use solution 2 and 3 on the nfs system I'm using nfs and I think that only recent nfs linux kernel server have implemented xattr. I'm using "enterprise" network appliance NAS, and I don't think they will handle it today. Anyway, it's good to know how to leverage ea to achieve my goal. > it does not fit dar internals, you can just define the size of a slice a > one byte accuracy, then dar will not make larger slices but will fill > them up but with the data to backup. Understood. > I don't see the use case of your requirement, can you develop? Here is the simplified story: - I've got two nas (nas0 et nas1) I access using nfs3. - I need to backup nas0 on nas1 (300 TBytes, 100+ millions of files), rebuild nas0 and restore its content from nas1 - nas0 is actively used until the downtime - I need to minimize the downtime My actual strategy is to run rsync mutiples times until the scheduled downtime. I'm running multiple rsync in parallel from multiple servers with the help of fpart/fpsync (https://www.fpart.org/). fpsync creates set of files (with a maximum total size or a maximum file number, whatever comes first). I do a last synchronization pass (using fpsync) during the downtime to catch up with the last modifications. I do what I have to do, and I restore using the same tools. It works reasonably well, I've done multiple migrations using this strategy. But I've got a lot of small files (<1KBytes), sometimes millions files at the same level in a single directory, or nicely distributed in sub-directories. I've also big files (sometimes weird characters in filenames ;)) Any tools using "posix" access will struggle in this situation. Splitting by number of files helps in such case by preventing a single rsync process handling 1 million 1K files (for a total of 1 GB). With that said, using this method, I'm paying the small files overhead reading and writing thoses files during the backup from source to destination, and also during the restoration. dar looks like a tool I could use to cut this overhead in half: keeping the overhead during the read phase of the backup, and the write phase during the restore. All other steps would deal with dar slices (writing "big" slices during backup and reading them during restore). Dar allows me to keep the iterative backup pass before the scheduled downtime. I'm just exploring ideas at the moment, and I do not have a definitive idea on how to make it works. Hence my questions about recursion, limiting files number in slices and parallelism. The devil is in the details of course but your suggestions of a single slice backup; maybe per directory, without recursion, looks like a good starting point to play with. Thank you again for your initial answer ! Jean-Baptiste On 5/21/22 21:27, Denis Corbin wrote: > Le 21/05/2022 à 20:24, Per Jensen a écrit : >> Hi, > > Hi Per, Hi Jean-Baptiste, > >> >> would something like: dar -c file-backup -R directory -I "*" -P "*" >> work ? > > it will "work" (dar would not complain) but it will not solve the problem. > >> >> where >> >> -I "*" selects all files >> >> -P "*" ignores all directories > > -P and -g option are applied to all entries: directories *and* file, > unlike -I and -X that only apply to files. Both are evaluated independently: > - a directory has only to pass the filtering of -P/-g/-[/-] options > applied to its whole path for its fate to be known. > - a file has *in addition* to satisfy the -X/-I filters applied solely > on its filename. > > Here in your proposal, you will get an empty backup, as -P "*" will > filter out everything, files and directories. -I option will not be > applied to plain files as they would already be filtered out. > >> >> Regards Per >> >> >> Den 21.05.2022 kl. 15.18 skrev Jean-Baptiste Denis: >>> Hello, >>> >>> I've got a directory with an awful lot of files beneath it, at a >>> single level. There is also a number of directories >>> that I don't know in advance. >>> >>> directory/ >>> ├── dir00 >>> ├── dir01 >>> ├── dir02 >>> ├── dir03 >>> ├── dir04 >>> ├── dir05 >>> ├── dir06 >>> ├── dir07 >>> ├── dir08 >>> ├── dir09 >>> ├── dir10 >>> ├── file0000000 >>> ├── file0000001 >>> ├── file0000002 >>> ├── file0000003 >>> ├── file0000004 >>> [...] >>> ├── file0999997 >>> ├── file9999998 >>> ├── file0999999 >>> └── file1000000 >>> >>> I'd like to use dar on "directory" without considering all its >>> subdirectories. From the documentation, it is not clear >>> *to me* if I can get dar doing this without doing some prework before: >>> first one to spot the directories using find or >>> equivalent, and second one using dar excluding them ? > > 1/ this is not perfect, but you can use this first approach (I assumed > "directory/" was located at the root of the filesystem): > > dar -c backup -R / -g directory -P "directory/*/*" ... other options... > > this will still save dir* files but not their content. > > 2/ else, if your filesystem supports EA, I mean has "user_xattr" added > to its mount options in /etc/fstab, but you can also set it up live using: > mount -o remount,user_xattr / > > then, a simple thing would be to add an Extended Attribute to all > directories of /directory and use the --exclude-by-ea option: > > find /directory -type d -exec setfattr -n user.libdar_no_backup {} \; > dar -c backup -R / -g directory --exclude-by-ea ... other options... > > 3/ you can also do the same without EA by using the dump flag (if > supported on the filesystem): > > find /directory -type d -exec chattr +d {} \; > dar -c backup -R / -g directory --nodump ... other options... > > 4/ last, if you do not want or cannot touch the filesystem under backup, > you have to list the directories to be excluded and provide the list to dar: > > find /directory -type d > /tmp/dirlist.txt > dar -c backup -R / -g directory --exclude-from-file /tmp/dirlist.txt > > I have no better option so far. > >>> >>> I'm hijacking my own thread with some side questions related somehow >>> to my initial question: >>> >>> 1. I'd like to have slices size up to a certain size OR containing at >>> most N files, whatever comes first. I don't know >>> if this would fit with dar internals. If that's the case, it could be >>> a nice new option. > > it does not fit dar internals, you can just define the size of a slice a > one byte accuracy, then dar will not make larger slices but will fill > them up but with the data to backup. I don't see the use case of your > requirement, can you develop? > >>> >>> 2. Is there a way to handle each slice independently (by running >>> multiple dar commands for example) >>> when extracting, or >>> iit doesn't make sense ? I could imagine dar slices stored on an nfs >>> server and multiple clients using dar (from an >>> external script) on differents slice in parallel, potentially >>> leveraging a datacenter/hpc network and a >>> parallel/distributed filesystem. > > parallelism works better with independent data sets. I would thus > suggest making many independent single sliced backup with dar and read > them as many concurrent dar commands. Though having a single dar backup > does not prevent many different dar command to read it and extract data > from it at the same time. Dar does not use any temporary file and does > not touch backups to restore data from them. > > But note also that for extracting data from a dar backup, there are FUSE > and AVFS clients [1]. This would leverage kernel VFS caching. And if you > "mount" a dar backup over NFS, the caching will be local to each NFS > client host, which may avoid network data transfer for data often > requested on a particular host. > > [1] http://dar.linux.free.fr/doc/presentation.html#external_tools > (well I don't know the status of these project, contact their authors > directly if you need to). > >>> >>> Thank you ! >>> >>> Jean-Baptiste >>> >> > > Cheers, > Denis > > |